Delay in Microphone Input When Talking While Receiving Audio in PTT Framework (Full Duplex Mode)

Context:

I am currently developing an app using the Push-to-Talk (PTT) framework. I have reviewed both the PTT framework documentation and the CallKit demo project to better understand how to properly manage audio session activation and AVAudioEngine setup.

I am not activating the audio session manually. The audio session configuration is handled in the incomingPushResult or didBeginTransmitting callbacks from the PTChannelManagerDelegate.

I am using a single AVAudioEngine instance for both input and playback. The engine is started in the didActivate callback from the PTChannelManagerDelegate. When I receive a push in full duplex mode, I set the active participant to the user who is speaking.


Issue

When I attempt to talk while the other participant is already speaking, my input tap on the input node takes a few seconds to return valid PCM audio data. Initially, it returns an empty PCM audio block.

Details:

  • The audio session is already active and configured with .playAndRecord.
  • The input tap is already installed when the engine is started.
  • When I talk from a neutral state (no one is speaking), the system plays the standard "microphone activation" tone, which covers this initial delay. However, this does not happen when I am already receiving audio.

Assumptions / Current Setup

  • Because the audio session is active in play and record, I assumed that microphone input would be available immediately, even while receiving audio.
  • However, there seems to be a delay before valid input is delivered to the tap, only occurring when switching from a receive state to simultaneously talking.

Questions

  1. Is this expected behavior when using the PTT framework in full duplex mode with a shared AVAudioEngine?
  2. Should I be restarting or reconfiguring the engine or audio session when beginning to talk while receiving audio?
  3. Is there a recommended pattern for managing microphone readiness in this scenario to avoid the initial empty PCM buffer?
  4. Would using separate engines for input and output improve responsiveness?

I would like to confirm the correct approach to handling simultaneous talk and receive in full duplex mode using PTT framework and AVAudioEngine. Specifically, I need guidance on ensuring the microphone is ready to capture audio immediately without the delay seen in my current implementation.


Relevant Code Snippets

Engine Setup

func setup() {
    let input = audioEngine.inputNode
    do {
        try input.setVoiceProcessingEnabled(true)
    } catch {
        print("Could not enable voice processing \(error)")
        return
    }

    input.isVoiceProcessingAGCEnabled = false

    let output = audioEngine.outputNode
    let mainMixer = audioEngine.mainMixerNode

    audioEngine.connect(pttPlayerNode, to: mainMixer, format: outputFormat)
    audioEngine.connect(beepNode, to: mainMixer, format: outputFormat)
    audioEngine.connect(mainMixer, to: output, format: outputFormat)

    // Initialize converters
    converter = AVAudioConverter(from: inputFormat, to: outputFormat)!
    f32ToInt16Converter = AVAudioConverter(from: outputFormat, to: inputFormat)!

    audioEngine.prepare()
}

Input Tap Installation

func installTap() {
    guard AudioHandler.shared.checkMicrophonePermission() else {
        print("Microphone not granted for recording")
        return
    }

    guard !isInputTapped else {
        print("[AudioEngine] Input is already tapped!")
        return
    }

    let input = audioEngine.inputNode
    let microphoneFormat = input.inputFormat(forBus: 0)
    let microphoneDownsampler = AVAudioConverter(from: microphoneFormat, to: outputFormat)!
    let desiredFormat = outputFormat
    let inputFramesNeeded = AVAudioFrameCount((Double(OpusCodec.DECODED_PACKET_NUM_SAMPLES) * microphoneFormat.sampleRate) / desiredFormat.sampleRate)
    input.installTap(onBus: 0, bufferSize: inputFramesNeeded, format: input.inputFormat(forBus: 0)) { [weak self] buffer, when in
        guard let self = self else { return }
        // Output buffer: 1920 frames at 16kHz
        guard let outputBuffer = AVAudioPCMBuffer(pcmFormat: desiredFormat, frameCapacity: AVAudioFrameCount(OpusCodec.DECODED_PACKET_NUM_SAMPLES)) else { return }
        outputBuffer.frameLength = outputBuffer.frameCapacity

        let inputBlock: AVAudioConverterInputBlock = { inNumPackets, outStatus in
            outStatus.pointee = .haveData
            return buffer
        }

        var error: NSError?
        let converterResult = microphoneDownsampler.convert(to: outputBuffer, error: &error, withInputFrom: inputBlock)

        if converterResult != .haveData {
            DebugLogger.shared.print("Downsample error \(converterResult)")
        } else {
            self.handleDownsampledBuffer(outputBuffer)
        }
    }
    isInputTapped = true
}
Answered by DTS Engineer in 852071022

When I talk from a neutral state (no one is speaking), the system plays the standard "microphone activation" tone, which covers this initial delay. However, this does not happen when I am already receiving audio.

Can you file a bug about the second (no tone) case and post the bug number back here? That's not what I expected and may be a bug.

Because the audio session is active in play and record, I assumed that microphone input would be available immediately, even while receiving audio.

That assumption is incorrect. It shouldn't be "long", but there will be a delay. What's actually going on here is callservicesd "releasing" audio input to your app, which does cause a short delay. I believe the delay is roughly the same as unmuting a CallKit call.

One thing to understand here is that, just like CallKit*, the PTT audio session is NOT actually a standard PlayAndRecord session. It can do things that the standard PlayAndRecord cannot (for example, it CANNOT be interrupted by other PlayAndRecord sessions) but it's also being manipulated by "external" controls in ways that other sessions are not.

*As background context, the PTT session is implemented and managed by the same "infrastructure" CallKit uses, which is why you see similar functionality.

  1. Is this expected behavior when using the PTT framework in full duplex mode with a shared AVAudioEngine?

Yes. The time you're describing sounds like it's on the "long" side, but the basic behavior is normal.

  1. Should I be restarting or reconfiguring the engine or audio session when beginning to talk while receiving audio?

No. Once you go active, don't mess with your audio session.

  1. Is there a recommended pattern for managing microphone readiness in this scenario to avoid the initial empty PCM buffer?

I'm not sure you want to get rid of it. It's been a while since I've played with this, but I don't think the system will ever send you "real" audio that's actually a zeroed buffer, as there's always SOME amount of audio "noise" the system will pick up. I think it's reasonable to just ignore zero'd buffers, but you might also be able to use those buffers to trigger (or better "prime") your own audio tone telling the user when they can speak.

  1. Would using separate engines for input and output improve responsiveness?

No, I would expect that to matter. However, the audio system is sufficiently complex that I also wouldn't be surprised if there were a weird configuration where it did.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

When I talk from a neutral state (no one is speaking), the system plays the standard "microphone activation" tone, which covers this initial delay. However, this does not happen when I am already receiving audio.

Can you file a bug about the second (no tone) case and post the bug number back here? That's not what I expected and may be a bug.

Because the audio session is active in play and record, I assumed that microphone input would be available immediately, even while receiving audio.

That assumption is incorrect. It shouldn't be "long", but there will be a delay. What's actually going on here is callservicesd "releasing" audio input to your app, which does cause a short delay. I believe the delay is roughly the same as unmuting a CallKit call.

One thing to understand here is that, just like CallKit*, the PTT audio session is NOT actually a standard PlayAndRecord session. It can do things that the standard PlayAndRecord cannot (for example, it CANNOT be interrupted by other PlayAndRecord sessions) but it's also being manipulated by "external" controls in ways that other sessions are not.

*As background context, the PTT session is implemented and managed by the same "infrastructure" CallKit uses, which is why you see similar functionality.

  1. Is this expected behavior when using the PTT framework in full duplex mode with a shared AVAudioEngine?

Yes. The time you're describing sounds like it's on the "long" side, but the basic behavior is normal.

  1. Should I be restarting or reconfiguring the engine or audio session when beginning to talk while receiving audio?

No. Once you go active, don't mess with your audio session.

  1. Is there a recommended pattern for managing microphone readiness in this scenario to avoid the initial empty PCM buffer?

I'm not sure you want to get rid of it. It's been a while since I've played with this, but I don't think the system will ever send you "real" audio that's actually a zeroed buffer, as there's always SOME amount of audio "noise" the system will pick up. I think it's reasonable to just ignore zero'd buffers, but you might also be able to use those buffers to trigger (or better "prime") your own audio tone telling the user when they can speak.

  1. Would using separate engines for input and output improve responsiveness?

No, I would expect that to matter. However, the audio system is sufficiently complex that I also wouldn't be surprised if there were a weird configuration where it did.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you for the detailed reply. I've submitted a bug report as requested: FB19421676Push-to-Talk Framework: Microphone activation tone does not play when sending while audio session is active in full duplex mode.

Thanks to the context you provided regarding how the PTT framework functions, I was able to identify the cause of the transmission delay I was experiencing. It turns out that isVoiceProcessingInputMuted was set to true when starting a transmission, and only reverted to false once audio output stopped. This was the source of the delay between initiating transmission and receiving valid microphone input.

By manually setting isVoiceProcessingInputMuted to false on the input node at the start of transmission, I was able to eliminate this delay and begin receiving microphone samples immediately.

I'm still relatively new to Swift and iOS audio development, and I was wondering if there are any sample projects or best practices that demonstrate integrating audio with the Push-to-Talk framework. Having a reference implementation would help me avoid common pitfalls and improve how I manage audio routing and session state.

Thanks again for your help!

Accepted Answer

Thank you for the detailed reply. I've submitted a bug report as requested: FB19421676

Perfect, thank you.

It turns out that isVoiceProcessingInputMuted was set to true when starting a transmission, and only reverted to false once audio output stopped. This was the source of the delay between initiating transmission and receiving valid microphone input.

Good find!

I'm still relatively new to Swift and iOS audio development, and I was wondering if there are any sample projects or best practices that demonstrate integrating audio with the Push-to-Talk framework.

No, there isn't any direct sample for it. Practically speaking, the PushToTalk framework was actually created to support an existing set of developers who'd previously built PTT apps using the "voip" background category and CallKit, so that they could migrate away from the unrestricted PTT entitlement. That's why we didn't create a sample— most of the framework's adopters were integrating the sample into an existing large-scale project, where a sample isn't very useful.

best practices that demonstrate integrating audio with the Push-to-Talk framework.

I do have a few suggestions/recommendations.

First off, keep in mind that the PushToTalk framework is very closely related to CallKit, using the same infrastructure and audio architecture. The critical point for both APIs is that you should NOT use "setActive" to directly activate your audio session. While the call itself will "work" (which is why developers keep doing it...), it will also interfere with the framework’s normal operation, creating other issues.

Next, I'd strongly recommend creating a dedicated test project that "sort out" the details of how the different components of your app will work, instead of trying to make a monolithic app. PTT apps are inherently quite complex, and that complexity often gets in the way of understanding and debugging the specifics of what you're working on. As a simple example, I'd start by making a "Fake PTT" app that looped or faked audio instead of actually transmitting/receiving audio. That will let you test the entire app life cycle and audio logic flow without ever bothering with a second test device.

Critically, these tools are important because they help divide your app into more manageable components that you can understand and analyze independently. I've investigated many problems where the final issue boiled down to "the app simply does not work the way the developer believes it does," and small test apps help counter that by both encouraging isolation between components and making it easier to actually focus on a particular area.

Finally, they're a critical tool for testing and validating your own assumptions. As one example, I've had multiple PTT developers claim that there is some critical problem with the PTT lifecycle. It's trivial to prove that's not the case— I have a small and ugly PTT test app* that implements the "full" PTT lifecycle in the absolute minimum possible code, which clearly shows the lifecycle itself works fine. I've seen many cases where a developer had wasted several weeks trying to work around a "system bug" when they could have proved the system itself worked fine by writing a simple test app that was specifically focused on that particular area.

*I'm happy to provide this code on request; however, it should not be considered "sample code" or even "good".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you for the detailed guidance, it was very helpful. I have submitted a code-level support request asking for the PTT test app you mentioned. I appreciate your time and support!

Delay in Microphone Input When Talking While Receiving Audio in PTT Framework (Full Duplex Mode)
 
 
Q