Concurrency Crash - PushToTalk Framework

With the integration of Apple's pushToTalk framework - we create the PTChannelManager using its async initializer from AppDidFinishLaunching - using an actor to ensure the PTChannelManager is only created once.

With this we have been seeing a lot of crashes for users in our analytics dashboards happening about ~2 seconds after app launch around a task-dealloc.

Here is a simplified version of our actor and Manager - where the manager just shows the init. The init of it is an async optional init because the creation of the PTChannelManager uses an async throws.

actor PushToTalkDeviceContainer {
    private var internalPushToTalkManagerTask: Task<PushToTalkManager?, Never>?
    
    func pushToTalkManager() async -> PushToTalkManager? {
        #if !os(visionOS)
        if let internalPushToTalkManagerTask {
            return await internalPushToTalkManagerTask.value
        }
        
        let internalPushToTalkManagerTask = Task<PushToTalkManager?, Never> {
            return await PushToTalkManagerImp()
        }
        
        self.internalPushToTalkManagerTask = internalPushToTalkManagerTask
        return await internalPushToTalkManagerTask.value
        #else
        return nil
        #endif
    }
}


public class PushToTalkManagerImp: PushToTalkManager {
    public let onPushToTalkDelegationEvent: AnyPublisher<PushToTalkDelegationEvent, Never>
    public let onPushToTalkAudioSessionChange: AnyPublisher<PushToTalkManagerAudioSessionChange, Never>
    public let onChannelRestoration: AnyPublisher<UUID, Never>
    
    private let ptChannelManager: PTChannelManager
    private let restorationDelegate: PushToTalkRestorationDelegate
    private let delegate: PushToTalkDelegate
    
    init?() async {
        self.delegate = PushToTalkDelegate()
        self.restorationDelegate = PushToTalkRestorationDelegate()
        self.onPushToTalkDelegationEvent = delegate.pushToTalkDelegationSubject.eraseToAnyPublisher()
        self.onPushToTalkAudioSessionChange = delegate.audioSessionSubject.eraseToAnyPublisher()
        self.onChannelRestoration = restorationDelegate.restorationDelegateSubject.eraseToAnyPublisher()
        
        do {
            ptChannelManager = try await PTChannelManager.channelManager(delegate: delegate, restorationDelegate: restorationDelegate)
        } catch {
            return nil
        }
    }
}

The crash stack trace is as follows:

0   libsystem_kernel.dylib        	0x00000001e903342c __pthread_kill + 8 (:-1)
1   libsystem_pthread.dylib       	0x00000001fcdd2c0c pthread_kill + 268 (pthread.c:1721)
2   libsystem_c.dylib             	0x00000001a7ed6c34 __abort + 136 (abort.c:159)
3   libsystem_c.dylib             	0x00000001a7ed6bac abort + 192 (abort.c:126)
4   libswift_Concurrency.dylib    	0x00000001ab2bf7c8 swift::swift_Concurrency_fatalErrorv(unsigned int, char const*, char*) + 32 (Error.cpp:25)
5   libswift_Concurrency.dylib    	0x00000001ab2bf7e8 swift::swift_Concurrency_fatalError(unsigned int, char const*, ...) + 32 (Error.cpp:35)
6   libswift_Concurrency.dylib    	0x00000001ab2c39a8 swift_task_dealloc + 128 (TaskAlloc.cpp:59)
7   MyApp                 	0x0000000104908e04 PushToTalkManagerImp.__allocating_init() + 40 (PushToTalkManager.swift:0)
8   MyApp                 	0x0000000104908e04 closure #1 in PushToTalkDeviceContainer.pushToTalkManager() + 60
9   MyApp                 	0x00000001041882e9 specialized thunk for @escaping @callee_guaranteed @Sendable @async () -> (@out A) + 1 (<compiler-generated>:0)
10  MyApp                  	0x0000000103a652bd partial apply for specialized thunk for @escaping @callee_guaranteed @Sendable @async () -> (@out A) + 1 (<compiler-generated>:0)
11  libswift_Concurrency.dylib    	0x00000001ab2c2775 completeTaskWithClosure(swift::AsyncContext*, swift::SwiftError*) + 1 (Task.cpp:463)
Answered by DTS Engineer in 792748022

I have a few different answers here:

-In terms of the specific crash, I'm not entire sure what EXACTLY is failing. I wasn't able to trace the language log message back to it's origin and those log messages aren't always that helpful. I'd need to see a full crash log to take a closer look at this.

-Have you tested this code on other failure points, notably macOS?

You specifically excluded visionOS, but PTChannelManager has a few other failure points with macOS being the most likely to occur (the other's are configuration issues which would always fail and running in the simulator).

As a side note here, I'm not sure the actor architecture here is doing very much to help you. The PTT delegate methods are going to be called on an private dispatch queue and I'd expect your app to be structured such that "PushToTalkManager" was only being accessed in a controlled/limited way.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I have a few different answers here:

-In terms of the specific crash, I'm not entire sure what EXACTLY is failing. I wasn't able to trace the language log message back to it's origin and those log messages aren't always that helpful. I'd need to see a full crash log to take a closer look at this.

-Have you tested this code on other failure points, notably macOS?

You specifically excluded visionOS, but PTChannelManager has a few other failure points with macOS being the most likely to occur (the other's are configuration issues which would always fail and running in the simulator).

As a side note here, I'm not sure the actor architecture here is doing very much to help you. The PTT delegate methods are going to be called on an private dispatch queue and I'd expect your app to be structured such that "PushToTalkManager" was only being accessed in a controlled/limited way.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

So the actor is being used to allow for lazy initialization of the pushToTalk framework while ensuring that it is only created once for the entire app.

For example - if there were two places in the app that needed to use the pushToTalkFramework - we would want to ensure that the ptchannelManager was only created once but that it was lazily created (so not made until either area requested it).

Normally we would use some sort of lock to ensure that the lazy variable was only created once - however since the ptChannelManager init is async we would not be able to ensure the lock is locked and unlocked from the same thread. This is why we introduced the actor to handle this.

The crash isn't happening on macOS - we are seeing it in our telemetry for users on various iOS versions (16 + 17) and across many iPhone and iPad devices.

I can work on trying to get a redacted full crash log and sharing here - if there is a way to share a private version though I think my team would prefer that.

This is the type of crash we are seeing:

Exception Type:  EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Termination Reason: SIGNAL 6 Abort trap: 6

From what I can tell (by putting fatal errors in my own PushToTalkManager init and comparing those crash logs to my crash logs) the crash is happening somewhere in apple's PTChannelManager.channelManager async function.

For example - if there were two places in the app that needed to use the pushToTalkFramework - we would want to ensure that the ptchannelManager was only created once but that it was lazily created (so not made until either area requested it).

The "two places at once" point is what I'm actually pointing at here. In a PTT app, PTTChannelManager acts as one supporting component of a larger architecture that's responsible for coordinating the other, much more complicated tasks, that actually make a PTT app "work". It needs to be a private component of that larger architecture ("anything" in your app shouldn't be messing with it's state) and that larger architecture ALSO needs it's own serialization and coordination system. I'm concerned you're trying to protect something for a situation that shouldn't really be happening in the first place.

Making that concrete, what you're protecting from here is double initialization, but that also implies that more "pieces" of your app are directly interacting with the PTChannelManager I'd be comfortable with. Also, ironically, as I was looking at the code again I realized that you're actually protecting against a situation that the PTChannelManager won't allow anyway. PTChannelManager is actually implemented as a static singleton, so if you DO call "channelManager(delegate..." multiple times, then either:

  1. If the call happens very early, before initialization is complete, then it'll return "null" and fail with "instantiationAlreadyInProgress".

  2. Once initialization is complete, the call will succeed and return the existing PTChannelManager object.

This is also in the documentation:

Multiple calls to channelManagerWithDelegate:restorationDelegate:completionHandler: result in the system returning the same shared instance, so store the channel manager in an instance variable."
FYI, it does look there is technically a bug here (r.130766827), since the second call succeeds but ignores the two delegate arguments. In the current state, their isn't anyway to change the delegate objects once PTChannelManager has been created. However, given how this class is intended to be used, this isn't a bug that should ever matter to your implementation.

Side note on PTChannelManager's general architecture. All of the access concerns I've raised here are really about how your app manages it's understanding of the PTT system, NOT the PTChannelManager itself. While it's not formally documented as "thread safe", all of the "work" it actually does is done by sending/receiving messages with system daemons (callservicesd). You'll notice that basically "all" of it's methods either have a direct completion handler or "request" an action which then "finishes" through a delegate callback. That all comes from it's interactions through the daemon over XPC. XPC is inherently asynchronous and that's mirrored by the API.

This creates what I think of as a "practically thread safe" API. That is, while the API isn't formally documented as "thread safe" and there wasn't an explicit choice to say "lets make sure everything is perfect", the PRACTICAL outcome is it bypasses all of the standard multithread failure points.

I can work on trying to get a redacted full crash log and sharing here - if there is a way to share a private version though I think my team would prefer that.

Sure. File a code-level support request and include that I requested it on this thread, then send the files through email once you receive the submission acknowledgement. It'll get to me, but you can also post the follow up ID here to streamline things.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Concurrency Crash - PushToTalk Framework
 
 
Q