Going from audio to MIDI is like going from Photoshop to Illustrator or Low Res graphics are to High Res graphics - the latter has more inherent structure and information depth than the former - which is a harder conversion than the reverse in a nondeterministic way. With the right DSP knowledge you might get partly there (but quite possibly NOT in real-time) but even this not easy and generally to be avoided if you can.
If you manage to achieve the DSP (do you have an engineering degree? :-/) then you probably have to write the AU from the MIDI AU base class. Because note scales are exponential (i.e. octave based, just like the human ear), you've probably have to consider something like a Patterson-Holdsworth filter - a regular FFT has linear frequency instead so it is prone to errors all tones/frequencies other than one selected frequency. Perhaps not surprisingly, people at Apple have been doing R&D on this - you'll notice the kind of math required - each note has its own octave so that would be 12 "gammatone" filters:
https://engineering.purdue.edu/~malcolm/apple/tr35/PattersonsEar.pdf