Turning on setVoiceProcessingEnabled bumps channel count to 5

Hi all,

The use of setVoiceProcessingEnabled increases the channel count of my microphone audio from 1 to 5. This has downstream effects, because when I use AVAudioConverter to convert between PCM buffer types the output buffer contains only silence.

Here is a reproduction showing the channel growth from 1 to 5:

    let avAudioEngine: AVAudioEngine = AVAudioEngine()
    let inputNode = avAudioEngine.inputNode
    print(inputNode.inputFormat(forBus: 0))
    // Prints <AVAudioFormat 0x600002f7ada0:  1 ch,  48000 Hz, Float32>

    do {
        try inputNode.setVoiceProcessingEnabled(true)
    } catch {
        print("Could not enable voice processing \(error)")
        return
    }

    print(inputNode.inputFormat(forBus: 0))
    // Prints <AVAudioFormat 0x600002f7b020:  5 ch,  44100 Hz, Float32, deinterleaved>

If it helps, the reason I'm using setVoiceProcessingEnabled because I don't want the mic to pick up output from the speakers. Per wwdc

When enabled, extra signal processing is applied on the incoming audio, and any audio that is coming from the device is taken

Here is my conversion logic from the input PCM format (which in the case above is 5ch, 44.1kHZ, Float 32, deinterleaved) to the target format PCM16 with a single channel:

let outputFormat = AVAudioFormat(
    commonFormat: .pcmFormatInt16,
    sampleRate: inputPCMFormat.sampleRate,
    channels: 1,
    interleaved: false
)

guard let converter = AVAudioConverter(
    from: inputPCMFormat,
    to: outputFormat) else {
    fatalError("Demonstration")
}

let newLength = AVAudioFrameCount(outputFormat.sampleRate * 2.0)  guard let outputBuffer = AVAudioPCMBuffer(
    pcmFormat: outputFormat,
    frameCapacity: newLength) else {
    fatalError("Demonstration")
}
outputBuffer.frameLength = newLength

try! converter.convert(to: outputBuffer, from: inputBuffer)

// Use the PCM16 outputBuffer

The outputBuffer contains only silence. But if I comment out inputNode.setVoiceProcessingEnabled(true) in the first snippet, the outputBuffer then plays exactly how I would expect it to.

So I have two questions:

  1. Why does setVoiceProcessingEnabled increase the channel count to 5?
  2. How should I convert the resulting format to a single channel PCM16 format?

Thank you, Lou

I just found something interesting. While AVAudioConverter doesn't play nicely with the five channels, it seems like AVAudioEngine's built in converters do. Because if I specify a tap like this:

let desiredTapFormat = AVAudioFormat(
    commonFormat: .pcmFormatInt16,
    sampleRate: inputPCMFormat.sampleRate,
    channels: 1,
    interleaved: false
)

inputNode.installTap(onBus: 0, bufferSize: 256, format: desiredTapFormat) { buffer, when in ... }

I find that the buffer argument already has a single channel, and it's not silence!

One more thing, I took this bit from some tensorflow source [1] code AVAudioFrameCount(outputFormat.sampleRate * 2.0), but I do not think that is a correct computation and have since removed it.

[1] https://sourcegraph.com/github.com/tensorflow/examples/-/blob/lite/examples/speech_commands/ios/SpeechCommands/AudioInputManager/AudioInputManager.swift?L89

Turning on setVoiceProcessingEnabled bumps channel count to 5
 
 
Q