PTT Framework has compatibility issue with .voiceChat AVAudioSession mode

As I've mentioned before our app uses PTT Framework to record and send audio messages. In one of supported by app mode we are using WebRTC.org library for that purpose. Internally WebRTC.org library uses Voice-Processing I/O Unit (kAudioUnitSubType_VoiceProcessingIO subtype) to retrieve audio from mic. According to https://developer.apple.com/documentation/avfaudio/avaudiosession/mode-swift.struct/voicechat using Voice-Processing I/O Unit leads to implicit enabling .voiceChat AVAudioSession mode (i.e. it looks like it's not possible to use Voice-Processing I/O Unit without .voiceChat mode).

And problem is following: when user starts outgoing PTT, PTT Framework plays audio notification, but in case of enabled .voiceChat mode that sound is playing distorted or not playing at all.

Questions:

  1. Is it known issue?
  2. Is there any way to workaround it?
Answered by DTS Engineer in 826597022

Let me start here:

And problem is following: when user starts outgoing PTT, PTT Framework plays audio notification, but in case of enabled .voiceChat mode that sound is playing distorted or not playing at all.

I don't think the voiceChat mode itself is the issue. The PTT Framework is directly derived from CallKit, particularly in terms of how it handles audio, and CallKit has no issue with this as our CallKit sample specifically uses that mode.

However, what IS a known issue is problems with integrating audio libraries that weren't specifically written with PTT/CallKit in mind:

As I've mentioned before our app uses PTT Framework to record and send audio messages. In one of supported by app mode we are using WebRTC.org library for that purpose. Internally WebRTC.org library uses Voice-Processing I/O Unit (kAudioUnitSubType_VoiceProcessingIO subtype) to retrieve audio from mic.

The big issue here is that most audio libraries do their own session activation and that pattern doesn't work for PTT/CallKit. Depending on exactly what exactly the library does (and when), that can cause exactly the kinds of problems you're describing when the audio session ends up misconfigured in a way that the library isn't expecting. The solution here is basically "don't activate that audio session yourself". The PTT framework should handle all session activation, with the only exception from being the interruption handler. See our CallKit sample for how this should work (again, CallKit and PTT handle audio in the same way).

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Let me start here:

And problem is following: when user starts outgoing PTT, PTT Framework plays audio notification, but in case of enabled .voiceChat mode that sound is playing distorted or not playing at all.

I don't think the voiceChat mode itself is the issue. The PTT Framework is directly derived from CallKit, particularly in terms of how it handles audio, and CallKit has no issue with this as our CallKit sample specifically uses that mode.

However, what IS a known issue is problems with integrating audio libraries that weren't specifically written with PTT/CallKit in mind:

As I've mentioned before our app uses PTT Framework to record and send audio messages. In one of supported by app mode we are using WebRTC.org library for that purpose. Internally WebRTC.org library uses Voice-Processing I/O Unit (kAudioUnitSubType_VoiceProcessingIO subtype) to retrieve audio from mic.

The big issue here is that most audio libraries do their own session activation and that pattern doesn't work for PTT/CallKit. Depending on exactly what exactly the library does (and when), that can cause exactly the kinds of problems you're describing when the audio session ends up misconfigured in a way that the library isn't expecting. The solution here is basically "don't activate that audio session yourself". The PTT framework should handle all session activation, with the only exception from being the interruption handler. See our CallKit sample for how this should work (again, CallKit and PTT handle audio in the same way).

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

@DTS Engineer yes, I know about this restriction. And I've disabled AudioSession activation inside WebRTC.org library (but I'll check one more time to be sure 100%) and still see issue. Also as I mentioned before, in our code we have path using AVAudioRecorder for the same purpose, and adding .voiceChat mode there gives identical result. And everything is fine without it.

Also, to avoid possible side effects from third part libraries with hard to understand side effects I've did small experiment with AVAudioEngine:

let engine = AVAudioEngine()
defer { engine.stop() }
let inputNode = engine.inputNode // just to create node accessing mic
try inputNode.setVoiceProcessingEnabled(false) // if remove this, PTT start sonification will be distorted!!!
try engine.start()
try await Task.sleep(for: .milliseconds(150))

and as mentioned in comment, if remove try inputNode.setVoiceProcessingEnabled(false) PTT start sound become distorted. I.e. it's something specific exactly to voice processing (.voiceChat mode enables the same thing I believe)

And I don't talk about audio recording/playing by app, I'm talking only about sound playing by PTT Framework to indicate PTT start. That's why comparing with CallKit has no meaning, since CallKit doesn't play such sound.

So, let me start by providing a bit of background context on why CallKit is relevant and helpful here:

And I don't talk about audio recording/playing by app, I'm talking only about sound playing by PTT Framework to indicate PTT start. That's why comparing with CallKit has no meaning, since CallKit doesn't play such sound.

Architecturally, the PTT framework is an extension of CallKit, not an independant API. That is, the PTT session your app manages is actually a modified variant of the same "calls" callservicesd. This is particularly true for audio handling, as the PTT framework basically doesn't implement any of it's "own" in process audio handling, relying entirely on CallKit's implementation.

Now, that leads to here:

and as mentioned in comment, if remove try inputNode.setVoiceProcessingEnabled(false) PTT start sound become distorted.

Throwing out some educated guesses, are you:

I ask because those to factor can cause exactly the issue you mentioned here:

Also, the way how PTT start sound is distorted, sounds for me very similar to what I hear if I start some audio playback with .playback category and do switch to .playAndRecord category while audio is still playing...

...for exactly the same reason. The issue with bluetooth in particular is that bluetooth has two different specification for dealing with audio:

  1. Advanced Audio Distribution Profile (A2DP)-> This is playback only and is what speakers use when playing audio.

  2. Hands-Free Profile (HFP)-> This is bidirectional, allowing playback and recording.

The distinction here matters because A2DP has significantly higher fidelity than HFP and simply "sounds" better*.

*Note that this is for entirely practical reasons. HFP has roughly "half" the playback bandwidth as A2DP, since it has to divide it's bandwith between playback and recording. In addition, HFP use different (and less efficient) audio compression codecs because it's latency is MUCH lower (~30ms vs ~100ms+).

In any case, the net result is that the switch from A2DP to HFP caused by something like:

start some audio playback with .playback category and do switch to .playAndRecord category while audio is still playing

...will cause unavoidably cause quite noticeable* (and unpleasant) decline in audio quality. That same behavior will occur when enabling voice processing for the same reason- voice processing enables input and output, which is effectively the same as playAndRecord.

*Change will occur on wired I/O as well, but the effect is most significant on bluetooth.

With all that context, let me go back to here:

Is there any way to workaround it?

So, the basic answer here is that you need to either ensure that the transitions only occurs when no playback is occurring (which isn't really possible) or ensure that the session "stays" in playAndRecord.

Being more specific, I have two answers:

  1. If you're using PTTransmissionMode.halfDuplex, stop and switch to PTTransmissionMode.fullDuplex. In hindsight, we probably shouldn't have bothered implementing halfDuplex, as it basically forces exactly this kind of awkward transition. More the point, my experience has been that products which truly are halfDuplex often end up using fullDuplex anyway because the mechanics of our halfDuplex implementation don't actually match up with their implementation. PTTransmissionMode.fullDuplex ends up just working "better", as they can use the additional flexibility to implement exactly the behavior they want*.

  2. If you're using PTTransmissionMode.fullDuplex, then (I think) leaving voiceChat enabled at all times will avoid the problem. If it's not, then you should take a closer at exactly what and where your app does audio configuration. My guess is that you're doing "something" which is pushing you back to playback only, setting up the distortion when your start recording again.

*Keep in mind that the system enabling tranmission/recording does NOT mean your app actually has to "do" anything with the audio it received from the system. For example, you can implement floor "claiming" mechanics have requesting the floor when you receive the transmission request, then play a sound (so the user knows it's happened) and start actually recording and sending audio once they have the floor.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

@DTS Engineer I've implemented minimal app reproducing issue. Please look https://github.com/RSATom/PTTPlayground

I've tested it on iPhone 14 Pro and was able to reproduce issue with 100% tries.

@DTS Engineer I've implemented minimal app reproducing issue. Please look https://github.com/RSATom/PTTPlayground I've tested it on iPhone 14 Pro and was able to reproduce issue with 100% tries.

Perfect! Thank you for the test app! I missed it in my own testing this morning but the PTT Engineering lead caught found the issue after I passed your sample over to him.

So, a few different things here:

  1. Please file a bug on this and post the number back here. Your code is muting the recording warning and that simply should not be possible under any circumstances.

  2. The direct issue in your code is the "setVoiceProcessingEnabled(false)". I'm not sure what you're intention here was, but there's obviously a conflict between setting ".voiceChat" and then disabling voice processing. With "setVoiceProcessingEnabled(true)", the record warning tone always played.

private func startRecording() async {
...
			let engine = AVAudioEngine()
			defer { engine.stop() }
			let inputNode = engine.inputNode // just to create node accessing mic
			try inputNode.setVoiceProcessingEnabled(false)
			try engine.start()
			try await Task.sleep(for: .milliseconds(300))
...

  1. As a broader comment, I'm concerned about timing issues around how you're using "Task" to manipulate the audio system. My basic recommendation for managing the audio system is that you should configure everything as early as possible and that all configuration MUST be done before you return from any method that starts the audio activation process (in this case, "didBeginTransmittingFrom" and "incomingPushResult"). Your use of "Task" in side "didBeginTransmittingFrom" means that you're not holding to that contract- that is, you've "scheduled" the changes to occur, but you haven't actually made those changes.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

@DTS Engineer I've repoted bug FB16716531

The direct issue in your code is the "setVoiceProcessingEnabled(false)". I'm not sure what you're intention here was, but there's obviously a conflict between setting ".voiceChat" and then disabling voice processing. With "setVoiceProcessingEnabled(true)", the record warning tone always played.

Actually production app doesn't use AVAudioEngine at all. I've used it just to create minimal possible repro case. But with AVAudioRecorder I see the same issue. I've added enabling AVAudioRecorder use to demo app, please check. Also app has very similar issue with Voice-Processing I/O Unit but I didn't add it to demo since it will need some additional effort and overcomplicate demo app.

As a broader comment, I'm concerned about timing issues around how you're using "Task" to manipulate the audio system.

I understand your point, but as usual problem here is with multithreading. PTT Framework uses some worker thread to send notifications to PTChannelManagerDelegate, so It required to have some threads synchronisation there. Obvious options are:

  1. Task
  2. DispatchQueue.sync()
  3. Mutex
  4. DispatchSemaphore

and problem is all except 1. will block calling thread. And I don't think it's good. Also, It's not possible just use hardcoded AVAudioSession options, since they can be different when it uses AVAudioRecorder or Voice-Processing I/O Unit (WebRTC).

Actually production app doesn't use AVAudioEngine at all. I've used it just to create minimal possible repro case. But with AVAudioRecorder I see the same issue.

Yes. Expanding on something I said earlier, my longstanding (originally based on experience with CallKit and now applied to the PTT framework) view is that VOIP/PTT apps can't really be built on our high level audio APIs (like AVAudioRecorder). There have been specific reasons for that (notably, the audio session activation issues), but the more fundamental issue the interaction between these factors:

  1. The audio session configuration used by voip/PTT apps is "weird". At a surface level, it looks non-mixable PlayAndRecord session but that is NOT what it actually is. As the most obvious example, it has a higher activation priority than ANY other session on this system, making it impossible to interrupt*.

  2. By it's nature the phone audio session is not widely used or broadly tested**. The phone session is really only used by voip/ptt a TINY number of very specific corner cases (for example, I believe critical alert playback may use it). It's very well tested within that use case, but that's a VERY narrow testing "band" compared to any other other audio configuration.

  3. The audio configuration system is a massively complicated and multilayered collection of APIs which is not internally consistent. You're seeing an example of that right here- what should it actually "mean" for session to be configured as "voiceChat" AND "setVoiceProcessingEnabled(false)"? There's clearly overlap between those two settings, but neither of those setting has a specific enough definition that you could actually predict what will happen from our documentation.

  4. Those factor's together mean that it's very likely that there are some number of audio configurations that "don't work right", with exactly the issue you're seeing being a good example of what that looks like. It's not that the API "fails", it's that you see weird behavior without any obvious explanation.

  5. By their nature, using our high level APIs means you both don't know how the audio system is ACTUALLY configured (not in detail) and have very limited ability to modify that configuration. More to the point, whatever works "today" could easily break "tomorrow" when the details of the APIs configuration change again.

*Strictly speaking, CallKit apps don't "interrupt" each other in the same way other session types "do". Instead, callservicesd acts as the arbitrage point between calling apps, switching between active calls by deactivating one apps audio session as it activates the other.

**As a specific note on testing and AVAudioRecorder in particular, keep in mind that the VAST majority of voip apps are doing "real time" communications which completely rules out using a file based API like AVAudioRecorder.

All of those factor's together is what creates the kind of issue you're seeing here. It's absolutely a bug that this configuration is able to mute the recording warning but us fixing it doesn't change the fact that there are other issues you've overlooked or prevent us from creating new issues in the future. That leads to here:

I've added enabling AVAudioRecorder use to demo app, please check.

No, there isn't really any point in me testing with AVAudioRecorder. For all the reasons I've outlined above, I don't expect it to work correctly and am not surprised that it's failing. For ANY issue you find using AVAudioRecorder with CallKit/PTT the only answer I'll ever be able to give is:

  • That looks like it might be bug, please file a bug report on it.

  • I can't promise the "bug" will ever be fixed. Bugs like this happen because the API has "picked" a configuration that doesn't work for you, but that means that any fix means either changing the configuration (which is VERY likely to break someone else) or adding "more" to the API so that you can control the configuration (complicating an API that's intended to be simple).

  • If a fix does occur, it's very likely to be part of a major system release ("18.0") NOT a system update ("18.2"), since changes to APIs like this carry a high risk of disrupting existing apps.

...all of which mean that the only reliable solution is to move to a lower level API and modify it's configuration until you find a configuration that does what you need.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I understand your point, but as usual problem here is with multithreading. PTT Framework uses some worker thread to send notifications to PTChannelManagerDelegate, so It required to have some threads synchronisation there.

So, there are a few different issue here:

  1. AVAudioSession does not care what thread it's called on.

  2. The problem with having multiple threads calling into it simultaneously ISN'T that AVAudioSession will "care" (I don't think it will), it's that your session no longer has a coherent configuration, since you'll end "mixing" the configuration of both threads.

  3. Most apps have natural "bottlenecks" where session configuration occurs, so the simplest solution is to simple restrict the configuration to those bottlenecks and allow them to provide thread coherence. Case in point, in your code PTChannelManagerDelegate is where session configuration is occurring and you could (and probably should) simply do the configuration calls there.

problem is all except 1. will block calling thread. And I don't think it's good

One of the problems with our more "modern" programming paradigms is a tendency to rely and basic "rules" like "blocking thread is bad" without considering the details of a specific situation and whether or not those rules really apply. Looking at implementation, Task is in fact the WORST option in that list. More specifically:

  • It's by far the least efficient, as it's triggering a thread context switch to perform a TINY amount of work. One of the traps GCD and now Task create is they can make it very easy to create WILDLY inefficient code, which you THINK is actually fast. See this forum post for a detailed example of that dynamic.

  • You're pushing the main thread, which means you've now introduced unpredictable latency around when the configuration will occur. That's the starting point for many "Why does this weird thing happen randomly" bugs.

Note that the second issue is a much bigger concern than the first. When it comes to threading, I'll pick "slow and predicable" over "fast and unpredictable" every time. Unfortunately, using Task here is giving you "slow and unpredictable", the worst option of all.

In terms of better options:

  • As written, you don't actually need any threading protection for your thread configuration. All your configuration code is PTChannelManagerDelegate, so that's a natural bottleneck (#3) which provides thread safety.

A real app could obviously be more complex, in which case the best option is virtually certain to be "use a lock of some flavor". More specifically, one of two things is occurring here:

  1. This audio configuration occurs relatively infrequently at well defined "real world moments" (like the PTT transmission delegate or when the user pushes a button) and that means the corresponding state changes are also infrequent.

  2. This configuration code IS occurring frequently enough that lock contention is possible.

In the first case, lock collisions will be extremely rare and the fastest possible threading safety primitive is an uncontested lock. In the second case, something is deeply weird and broken in your app and you should fix it, as audio state simply isn't something that can or should change all that often.

Note that this means that picking the "right" locking primitive is largely beside the point. Strictly speaking, OSAllocatedUnfairLock is probably the fast locking primitive available, however, the performance of a lock that's used infrequently and rarely experiences contention doesn't really matter.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

@DTS Engineer Thank you for that long explanation. It's really useful.

Expanding on something I said earlier, my longstanding (originally based on experience with CallKit and now applied to the PTT framework) view is that VOIP/PTT apps can't really be built on our high level audio APIs (like AVAudioRecorder).

It has meaning. But what framework you would recommend for audio recording with PTT Framework? As I said before I see issue with Voice-Processing I/O Unit and it's pretty low level I believe. Of course it can be related to wrong Audio Session management you told above, I'll check it, but anyway, what you would recommend for that task?

PTT Framework has compatibility issue with .voiceChat AVAudioSession mode
 
 
Q