PTT Framework has compatibility issue with .voiceChat AVAudioSession mode

Question

Created Feb ’25

Replies 17

Boosts 0

Participants 2

As I've mentioned before our app uses PTT Framework to record and send audio messages. In one of supported by app mode we are using WebRTC.org library for that purpose. Internally WebRTC.org library uses Voice-Processing I/O Unit (kAudioUnitSubType_VoiceProcessingIO subtype) to retrieve audio from mic. According to https://developer.apple.com/documentation/avfaudio/avaudiosession/mode-swift.struct/voicechat using Voice-Processing I/O Unit leads to implicit enabling .voiceChat AVAudioSession mode (i.e. it looks like it's not possible to use Voice-Processing I/O Unit without .voiceChat mode).

And problem is following: when user starts outgoing PTT, PTT Framework plays audio notification, but in case of enabled .voiceChat mode that sound is playing distorted or not playing at all.

Questions:

Is it known issue?
Is there any way to workaround it?

Boost

Answer 1

DTS Engineer OP

Apple

Feb ’25

Recommended

Let me start here:

And problem is following: when user starts outgoing PTT, PTT Framework plays audio notification, but in case of enabled .voiceChat mode that sound is playing distorted or not playing at all.

I don't think the voiceChat mode itself is the issue. The PTT Framework is directly derived from CallKit, particularly in terms of how it handles audio, and CallKit has no issue with this as our CallKit sample specifically uses that mode.

However, what IS a known issue is problems with integrating audio libraries that weren't specifically written with PTT/CallKit in mind:

As I've mentioned before our app uses PTT Framework to record and send audio messages. In one of supported by app mode we are using WebRTC.org library for that purpose. Internally WebRTC.org library uses Voice-Processing I/O Unit (kAudioUnitSubType_VoiceProcessingIO subtype) to retrieve audio from mic.

The big issue here is that most audio libraries do their own session activation and that pattern doesn't work for PTT/CallKit. Depending on exactly what exactly the library does (and when), that can cause exactly the kinds of problems you're describing when the audio session ends up misconfigured in a way that the library isn't expecting. The solution here is basically "don't activate that audio session yourself". The PTT framework should handle all session activation, with the only exception from being the interruption handler. See our CallKit sample for how this should work (again, CallKit and PTT handle audio in the same way).

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 2

epio OP

Feb ’25

@DTS Engineer yes, I know about this restriction. And I've disabled AudioSession activation inside WebRTC.org library (but I'll check one more time to be sure 100%) and still see issue. Also as I mentioned before, in our code we have path using AVAudioRecorder for the same purpose, and adding .voiceChat mode there gives identical result. And everything is fine without it.

Also, to avoid possible side effects from third part libraries with hard to understand side effects I've did small experiment with AVAudioEngine:

let engine = AVAudioEngine()
defer { engine.stop() }
let inputNode = engine.inputNode // just to create node accessing mic
try inputNode.setVoiceProcessingEnabled(false) // if remove this, PTT start sonification will be distorted!!!
try engine.start()
try await Task.sleep(for: .milliseconds(150))

and as mentioned in comment, if remove try inputNode.setVoiceProcessingEnabled(false) PTT start sound become distorted. I.e. it's something specific exactly to voice processing (.voiceChat mode enables the same thing I believe)

And I don't talk about audio recording/playing by app, I'm talking only about sound playing by PTT Framework to indicate PTT start. That's why comparing with CallKit has no meaning, since CallKit doesn't play such sound.

0

Answer 3

DTS Engineer OP

Apple

Feb ’25

So, let me start by providing a bit of background context on why CallKit is relevant and helpful here:

And I don't talk about audio recording/playing by app, I'm talking only about sound playing by PTT Framework to indicate PTT start. That's why comparing with CallKit has no meaning, since CallKit doesn't play such sound.

Architecturally, the PTT framework is an extension of CallKit, not an independant API. That is, the PTT session your app manages is actually a modified variant of the same "calls" callservicesd. This is particularly true for audio handling, as the PTT framework basically doesn't implement any of it's "own" in process audio handling, relying entirely on CallKit's implementation.

Now, that leads to here:

and as mentioned in comment, if remove try inputNode.setVoiceProcessingEnabled(false) PTT start sound become distorted.

Throwing out some educated guesses, are you:

Using PTTransmissionMode.halfDuplex.
Testing with bluetooth headphones.

I ask because those to factor can cause exactly the issue you mentioned here:

Also, the way how PTT start sound is distorted, sounds for me very similar to what I hear if I start some audio playback with .playback category and do switch to .playAndRecord category while audio is still playing...

...for exactly the same reason. The issue with bluetooth in particular is that bluetooth has two different specification for dealing with audio:

Advanced Audio Distribution Profile (A2DP)-> This is playback only and is what speakers use when playing audio.
Hands-Free Profile (HFP)-> This is bidirectional, allowing playback and recording.

The distinction here matters because A2DP has significantly higher fidelity than HFP and simply "sounds" better*.

*Note that this is for entirely practical reasons. HFP has roughly "half" the playback bandwidth as A2DP, since it has to divide it's bandwith between playback and recording. In addition, HFP use different (and less efficient) audio compression codecs because it's latency is MUCH lower (~30ms vs ~100ms+).

In any case, the net result is that the switch from A2DP to HFP caused by something like:

start some audio playback with .playback category and do switch to .playAndRecord category while audio is still playing

...will cause unavoidably cause quite noticeable* (and unpleasant) decline in audio quality. That same behavior will occur when enabling voice processing for the same reason- voice processing enables input and output, which is effectively the same as playAndRecord.

*Change will occur on wired I/O as well, but the effect is most significant on bluetooth.

With all that context, let me go back to here:

Is there any way to workaround it?

So, the basic answer here is that you need to either ensure that the transitions only occurs when no playback is occurring (which isn't really possible) or ensure that the session "stays" in playAndRecord.

Being more specific, I have two answers:

If you're using PTTransmissionMode.halfDuplex, stop and switch to PTTransmissionMode.fullDuplex. In hindsight, we probably shouldn't have bothered implementing halfDuplex, as it basically forces exactly this kind of awkward transition. More the point, my experience has been that products which truly are halfDuplex often end up using fullDuplex anyway because the mechanics of our halfDuplex implementation don't actually match up with their implementation. PTTransmissionMode.fullDuplex ends up just working "better", as they can use the additional flexibility to implement exactly the behavior they want*.
If you're using PTTransmissionMode.fullDuplex, then (I think) leaving voiceChat enabled at all times will avoid the problem. If it's not, then you should take a closer at exactly what and where your app does audio configuration. My guess is that you're doing "something" which is pushing you back to playback only, setting up the distortion when your start recording again.

*Keep in mind that the system enabling tranmission/recording does NOT mean your app actually has to "do" anything with the audio it received from the system. For example, you can implement floor "claiming" mechanics have requesting the floor when you receive the transmission request, then play a sound (so the user knows it's happened) and start actually recording and sending audio once they have the floor.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 4

epio OP

Mar ’25

@DTS Engineer I've implemented minimal app reproducing issue. Please look https://github.com/RSATom/PTTPlayground

I've tested it on iPhone 14 Pro and was able to reproduce issue with 100% tries.

0

Answer 5

DTS Engineer OP

Apple

Mar ’25

@DTS Engineer I've implemented minimal app reproducing issue. Please look https://github.com/RSATom/PTTPlayground I've tested it on iPhone 14 Pro and was able to reproduce issue with 100% tries.

Perfect! Thank you for the test app! I missed it in my own testing this morning but the PTT Engineering lead caught found the issue after I passed your sample over to him.

So, a few different things here:

Please file a bug on this and post the number back here. Your code is muting the recording warning and that simply should not be possible under any circumstances.
The direct issue in your code is the "setVoiceProcessingEnabled(false)". I'm not sure what you're intention here was, but there's obviously a conflict between setting ".voiceChat" and then disabling voice processing. With "setVoiceProcessingEnabled(true)", the record warning tone always played.

private func startRecording() async {
...
			let engine = AVAudioEngine()
			defer { engine.stop() }
			let inputNode = engine.inputNode // just to create node accessing mic
			try inputNode.setVoiceProcessingEnabled(false)
			try engine.start()
			try await Task.sleep(for: .milliseconds(300))
...

As a broader comment, I'm concerned about timing issues around how you're using "Task" to manipulate the audio system. My basic recommendation for managing the audio system is that you should configure everything as early as possible and that all configuration MUST be done before you return from any method that starts the audio activation process (in this case, "didBeginTransmittingFrom" and "incomingPushResult"). Your use of "Task" in side "didBeginTransmittingFrom" means that you're not holding to that contract- that is, you've "scheduled" the changes to occur, but you haven't actually made those changes.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 6

epio OP

Mar ’25

@DTS Engineer I've repoted bug FB16716531

The direct issue in your code is the "setVoiceProcessingEnabled(false)". I'm not sure what you're intention here was, but there's obviously a conflict between setting ".voiceChat" and then disabling voice processing. With "setVoiceProcessingEnabled(true)", the record warning tone always played.

Actually production app doesn't use AVAudioEngine at all. I've used it just to create minimal possible repro case. But with AVAudioRecorder I see the same issue. I've added enabling AVAudioRecorder use to demo app, please check. Also app has very similar issue with Voice-Processing I/O Unit but I didn't add it to demo since it will need some additional effort and overcomplicate demo app.

As a broader comment, I'm concerned about timing issues around how you're using "Task" to manipulate the audio system.

I understand your point, but as usual problem here is with multithreading. PTT Framework uses some worker thread to send notifications to PTChannelManagerDelegate, so It required to have some threads synchronisation there. Obvious options are:

Task
DispatchQueue.sync()
Mutex
DispatchSemaphore

and problem is all except 1. will block calling thread. And I don't think it's good. Also, It's not possible just use hardcoded AVAudioSession options, since they can be different when it uses AVAudioRecorder or Voice-Processing I/O Unit (WebRTC).

0

Answer 7

DTS Engineer OP

Apple

Mar ’25

Actually production app doesn't use AVAudioEngine at all. I've used it just to create minimal possible repro case. But with AVAudioRecorder I see the same issue.

Yes. Expanding on something I said earlier, my longstanding (originally based on experience with CallKit and now applied to the PTT framework) view is that VOIP/PTT apps can't really be built on our high level audio APIs (like AVAudioRecorder). There have been specific reasons for that (notably, the audio session activation issues), but the more fundamental issue the interaction between these factors:

The audio session configuration used by voip/PTT apps is "weird". At a surface level, it looks non-mixable PlayAndRecord session but that is NOT what it actually is. As the most obvious example, it has a higher activation priority than ANY other session on this system, making it impossible to interrupt*.
By it's nature the phone audio session is not widely used or broadly tested**. The phone session is really only used by voip/ptt a TINY number of very specific corner cases (for example, I believe critical alert playback may use it). It's very well tested within that use case, but that's a VERY narrow testing "band" compared to any other other audio configuration.
The audio configuration system is a massively complicated and multilayered collection of APIs which is not internally consistent. You're seeing an example of that right here- what should it actually "mean" for session to be configured as "voiceChat" AND "setVoiceProcessingEnabled(false)"? There's clearly overlap between those two settings, but neither of those setting has a specific enough definition that you could actually predict what will happen from our documentation.
Those factor's together mean that it's very likely that there are some number of audio configurations that "don't work right", with exactly the issue you're seeing being a good example of what that looks like. It's not that the API "fails", it's that you see weird behavior without any obvious explanation.
By their nature, using our high level APIs means you both don't know how the audio system is ACTUALLY configured (not in detail) and have very limited ability to modify that configuration. More to the point, whatever works "today" could easily break "tomorrow" when the details of the APIs configuration change again.

*Strictly speaking, CallKit apps don't "interrupt" each other in the same way other session types "do". Instead, callservicesd acts as the arbitrage point between calling apps, switching between active calls by deactivating one apps audio session as it activates the other.

**As a specific note on testing and AVAudioRecorder in particular, keep in mind that the VAST majority of voip apps are doing "real time" communications which completely rules out using a file based API like AVAudioRecorder.

All of those factor's together is what creates the kind of issue you're seeing here. It's absolutely a bug that this configuration is able to mute the recording warning but us fixing it doesn't change the fact that there are other issues you've overlooked or prevent us from creating new issues in the future. That leads to here:

I've added enabling AVAudioRecorder use to demo app, please check.

No, there isn't really any point in me testing with AVAudioRecorder. For all the reasons I've outlined above, I don't expect it to work correctly and am not surprised that it's failing. For ANY issue you find using AVAudioRecorder with CallKit/PTT the only answer I'll ever be able to give is:

That looks like it might be bug, please file a bug report on it.
I can't promise the "bug" will ever be fixed. Bugs like this happen because the API has "picked" a configuration that doesn't work for you, but that means that any fix means either changing the configuration (which is VERY likely to break someone else) or adding "more" to the API so that you can control the configuration (complicating an API that's intended to be simple).
If a fix does occur, it's very likely to be part of a major system release ("18.0") NOT a system update ("18.2"), since changes to APIs like this carry a high risk of disrupting existing apps.

...all of which mean that the only reliable solution is to move to a lower level API and modify it's configuration until you find a configuration that does what you need.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 8

DTS Engineer OP

Apple

Mar ’25

I understand your point, but as usual problem here is with multithreading. PTT Framework uses some worker thread to send notifications to PTChannelManagerDelegate, so It required to have some threads synchronisation there.

So, there are a few different issue here:

AVAudioSession does not care what thread it's called on.
The problem with having multiple threads calling into it simultaneously ISN'T that AVAudioSession will "care" (I don't think it will), it's that your session no longer has a coherent configuration, since you'll end "mixing" the configuration of both threads.
Most apps have natural "bottlenecks" where session configuration occurs, so the simplest solution is to simple restrict the configuration to those bottlenecks and allow them to provide thread coherence. Case in point, in your code PTChannelManagerDelegate is where session configuration is occurring and you could (and probably should) simply do the configuration calls there.

problem is all except 1. will block calling thread. And I don't think it's good

One of the problems with our more "modern" programming paradigms is a tendency to rely and basic "rules" like "blocking thread is bad" without considering the details of a specific situation and whether or not those rules really apply. Looking at implementation, Task is in fact the WORST option in that list. More specifically:

It's by far the least efficient, as it's triggering a thread context switch to perform a TINY amount of work. One of the traps GCD and now Task create is they can make it very easy to create WILDLY inefficient code, which you THINK is actually fast. See this forum post for a detailed example of that dynamic.
You're pushing the main thread, which means you've now introduced unpredictable latency around when the configuration will occur. That's the starting point for many "Why does this weird thing happen randomly" bugs.

Note that the second issue is a much bigger concern than the first. When it comes to threading, I'll pick "slow and predicable" over "fast and unpredictable" every time. Unfortunately, using Task here is giving you "slow and unpredictable", the worst option of all.

In terms of better options:

As written, you don't actually need any threading protection for your thread configuration. All your configuration code is PTChannelManagerDelegate, so that's a natural bottleneck (#3) which provides thread safety.

A real app could obviously be more complex, in which case the best option is virtually certain to be "use a lock of some flavor". More specifically, one of two things is occurring here:

This audio configuration occurs relatively infrequently at well defined "real world moments" (like the PTT transmission delegate or when the user pushes a button) and that means the corresponding state changes are also infrequent.
This configuration code IS occurring frequently enough that lock contention is possible.

In the first case, lock collisions will be extremely rare and the fastest possible threading safety primitive is an uncontested lock. In the second case, something is deeply weird and broken in your app and you should fix it, as audio state simply isn't something that can or should change all that often.

Note that this means that picking the "right" locking primitive is largely beside the point. Strictly speaking, OSAllocatedUnfairLock is probably the fast locking primitive available, however, the performance of a lock that's used infrequently and rarely experiences contention doesn't really matter.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 9

epio OP

Mar ’25

@DTS Engineer Thank you for that long explanation. It's really useful.

Expanding on something I said earlier, my longstanding (originally based on experience with CallKit and now applied to the PTT framework) view is that VOIP/PTT apps can't really be built on our high level audio APIs (like AVAudioRecorder).

It has meaning. But what framework you would recommend for audio recording with PTT Framework? As I said before I see issue with Voice-Processing I/O Unit and it's pretty low level I believe. Of course it can be related to wrong Audio Session management you told above, I'll check it, but anyway, what you would recommend for that task?

0

Answer 10

DTS Engineer OP

Apple

Mar ’25

First off, a clarification on this point:

As I said before I see issue with Voice-Processing I/O Unit and it's pretty low level I believe.

The key point about our audio APIs is that it's layered structure means that an issue like this can generally be generated in "any" of those layers. That is, what AVAudioRecorder actually does is use our lower layer APIs to create a configuration that creates a particular result. Similarly, what AVAudioEngine actually does is... use our lower layer APIs to create a configuration that creates a particular result.

Now, as an aside, many developers do find that an issue the were having with one API doesn't happen when the switch to a different (typically lower level) API. In most cases, what's actually going on here isn't that the lower level API is "better", but is simply that they didn't ACTUALLY replicate the same configuration the previous API was using.

The key point here is that the advantage an API like AVAudioEngine provides here isn't that it won't/can't have the same issues as AVAudioRecorder has (as your sample shows, it obviously can), it's that the additional configurability it offers mean that you can change the configuration to avoid the problem.

It has meaning. But what framework you would recommend for audio recording with PTT Framework?

First off, as a matter of full disclosure, I'm FAR from an expert in the specifics of our Audio APIs. Dealing with voip/CallKit/PTT since "forever" (I've been supporting "voip" for DTS since introduced in iOS 4) has given me a lot of experience with the mechanics of the audio session system and how all the configuration layers interact, but I've actually spent very little time looking at the actual process of capturing audio and sending it somewhere else.

In any case, the "right" choice heavily depends on what you actually need. Many voip apps work by basically collecting "raw" audio buffers which they immediately process. The audio code in Speakerbox are the basic start of what the looks like.

However, that approach is driven almost entirely by the specific details of their requirements. They're goal is to minimize latency and disruption risk, so they often end up sending the contents of each buffer as fast as they can and they may compress each object as an individual entity (instead of the "stream" of related packets most audio codec's create).

In your case, the fact that you're even considering AVAudioRecorder means that you're actually recording to a fixed file which, I assume, you'll only send once the entire recording is done. In that case, AVAudioEngine is probably what I'd start with.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 11

epio OP

Mar ’25

@DTS Engineer I've got your point. Thank you! But let me ask the same different way then - is there any way to avoid issue with PTT Start Audio Notification with Voice Processing Audio Unit?

P.S.:

I've moved Audio Session management to thread sending didBeginTransmittingFrom (I've used atomics for threads synchronisation) in demo app.
I've added very basic implementation using kAudioUnitSubType_VoiceProcessingIO to demo app.

And still see issue :( And with Audio Unit it doesn't matter if .voiceChat mode enabled or not...

0

Answer 12

DTS Engineer OP

Apple

Mar ’25

@DTS Engineer I've got your point. Thank you! But let me ask the same different way then - is there any way to avoid issue with PTT Start Audio Notification with Voice Processing Audio Unit?

I'm confused by your question. As I said above, when I set "setVoiceProcessingEnabled(true)" in your sample project, everything worked find. In your code, that would be:

private func startRecording() async {
...
			let engine = AVAudioEngine()
			defer { engine.stop() }
			let inputNode = engine.inputNode // just to create node accessing mic
-->			try inputNode.setVoiceProcessingEnabled(true)
			try engine.start()
			try await Task.sleep(for: .milliseconds(300))
...

Once I made that change, the audio notification played every time. Is there some reason why you don't what you use "setVoiceProcessingEnabled(true)"? Similarly, what issue are you having after you make that change?

And still see issue :( And with Audio Unit it doesn't matter if .voiceChat mode enabled or not...

I just looked at you're latest update and you're still doing "setVoiceProcessingEnabled(false)". Why?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 13

epio OP

Mar ’25

@DTS Engineer, sorry, it looks like I was not clear enough. Actually I don't care too much about AVAudioEngine based implementation right now. But I need to find some way to get PTT working with AudioToolbox framework (i.e. with something like https://github.com/RSATom/PTTPlayground/blob/49f4dedfee192a75789fd7b3566f78123afa7a86/PTTPlayground/PTT.swift#L223-L346)

The reason is WebRTC library uses AudioToolbox framework internally, and I would prefer to find some simple solution, rather than do significant refactoring to switch it to AVAudioEngine. So right now I just need to get conformation if it's really possible or not.

0

Answer 14

DTS Engineer OP

Apple

Mar ’25

@DTS Engineer, sorry, it looks like I was not clear enough. Actually I don't care too much about AVAudioEngine based implementation right now. But I need to find some way to get PTT working with AudioToolbox framework (i.e. with something like

I took another look at you code and it currently call "startRecording()" from "didActivate". You cannot do that. When your code is modified so that "startRecording()" is called DIRECTLY in "didBeginTransmittingFrom" (that means not Taks and dropping @MainActor), the recording beep played exactly as expected.

Repeating what I said in an earlier message:

"My basic recommendation for managing the audio system is that you should configure everything as early as possible and that all configuration MUST be done before you return from any method that starts the audio activation process (in this case, "didBeginTransmittingFrom" and "incomingPushResult")."

If something isn't working, the FIRST thing to try is "move it earlier". I understand the appeal of Swift's concurrency and safety model, however, due to it's internal architecture:

The audio system doesn't really care about what thread it's called on.
The audio system is sensitive to when changes are made within the audio management "cycle".

...and that means that Swift concurrency can easily create problems by shifting the timing of actions.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

1

Answer 15

epio OP

Mar ’25

@DTS Engineer thank you for your answer! But I'm a little bit confused. Do You want to say it's legitimate to create audio recording entities even if Audio Session is not activated? It looks like my main mistake is I always thought active audio session is mandatory condition to start work with audio subsystem 🤔

update: I think the source of my mistake is official documentation:

Before recording and transmitting audio, wait for the framework to call channelManager(_:didActivate:).

update 2: another thing I'm worrying about is establishing connection to audio receiving server. It can take some time and user will get audio notification about he/she may start talk before audio can be really sent. This means app has to implement some recorded audio buffering. On other side, connection to server can fail (especially if PTT started from background) and it can happen with significant delay, and from user point of view it will look like he/she actually successfully recorded audio (at least some peace of audio). Also user can fail to "claim floor" due to race condition (and it's highly possible with more than 2 participants in room, we've met it already). And if take into account PTT Framework doesn't offer any "fail" notification I'm really confused how to deal with it...

0