Under stress tests, our Network Extension crashed due to QOS?

Two different crash patterns -- one an abort, the other complaining about a lock being corrupt or owning thread having exited. The first one is:

Thread 1 Crashed::  Dispatch queue: com.apple.root.default-qos.overcommit
0   libsystem_platform.dylib      	       0x18fc10244 _os_unfair_lock_corruption_abort + 88
1   libsystem_platform.dylib      	       0x18fc0b788 _os_unfair_lock_lock_slow + 332
2   libobjc.A.dylib               	       0x18f820c90 objc_sync_enter + 20
3   com.kithrup.TPProvider	       0x100d2eee0 closure #3 in TPProvider.startProxy(options:completionHandler:) + 340
4   com.kithrup.TPProvider	       0x100d2d980 thunk for @escaping @callee_guaranteed () -> () + 28
5   libdispatch.dylib             	       0x18fa31910 _dispatch_client_callout + 20
6   libdispatch.dylib             	       0x18fa34dc8 _dispatch_continuation_pop + 600
7   libdispatch.dylib             	       0x18fa48be4 _dispatch_source_latch_and_call + 420
8   libdispatch.dylib             	       0x18fa477b4 _dispatch_source_invoke + 832
9   libdispatch.dylib             	       0x18fa431f4 _dispatch_root_queue_drain + 392
10  libdispatch.dylib             	       0x18fa43a04 _dispatch_worker_thread2 + 156
11  libsystem_pthread.dylib       	       0x18fbdb0d8 _pthread_wqthread + 228
12  libsystem_pthread.dylib       	       0x18fbd9e30 start_wqthread + 8

while the other one is:

Application Specific Information:
BUG IN CLIENT OF LIBPLATFORM: os_unfair_lock is corrupt, or owner thread exited without unlocking
Abort Cause 198194

Thread 1 Crashed::  Dispatch queue: com.apple.root.default-qos.overcommit
0   libsystem_platform.dylib                   0x18fc10220 _os_unfair_lock_corruption_abort + 52
1   libsystem_platform.dylib                   0x18fc0b788 _os_unfair_lock_lock_slow + 332
2   libobjc.A.dylib                            0x18f820c90 objc_sync_enter + 20
3   com.kithrup.TPProvider             0x104e86ee0 closure #3 in TPProvider.startProxy(options:completionHandler:) +340 
4   com.kithrup.TPProvider             0x104e85980 thunk for @escaping @callee_guaranteed () -> () + 28
5   libdispatch.dylib                          0x18fa31910 _dispatch_client_callout + 20
6   libdispatch.dylib                          0x18fa34dc8 _dispatch_continuation_pop + 600
7   libdispatch.dylib                          0x18fa48be4 _dispatch_source_latch_and_call + 420
8   libdispatch.dylib                          0x18fa477b4 _dispatch_source_invoke + 832
9   libdispatch.dylib                          0x18fa431f4 _dispatch_root_queue_drain + 392
10  libdispatch.dylib                          0x18fa43a04 _dispatch_worker_thread2 + 156
11  libsystem_pthread.dylib                    0x18fbdb0d8 _pthread_wqthread + 228
12  libsystem_pthread.dylib                    0x18fbd9e30 start_wqthread + 8

Our TPProvider, whenever it uses a dispatch queue, uses a custom one, so these are presumably system queues and locks. My best guess would be some XPC command took too long? But that's just WAG.

Any ideas about what is actually going on?

How reproducible is this?

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

I can't reproduce it on demand (other than by running our stress tests), but it happens somewhat regularly -- that is, when I ran out stress tests over the weekend on an Intel and AS machine, it crashed on both of them at least a couple of times.

Can you configure your stress tests to run with the debugger attached? That’d stop you in the debugger, which would give you more options for debugging this.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

I can try, but I've really not had a whole lot of success with trying to debug the network extension with a debugger. Anything I should look for or do, other than telling lldb to attach to the correct pid, and wait for a crash?

Anything I should look for or do, other than telling lldb to attach to the correct pid, and wait for a crash?

That’s the obvious place to start.

If you want to write more code you could:

  • Investigate core dumps.

  • Install a SIGABRT signal handler than stops the process to allow you to attach.

  • Disable SIP and start messing around with DTrace.

We might get to that point, but they all see like overkill right now.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

It crashed under lldb so I was able to backtrace -- it was my Swiftian @synchronized replacement, which (in some places) was using a Swift dictionary. I changed all the places to use an NSLock instead, and have been running it since, so far so good.

Under stress tests, our Network Extension crashed due to QOS?
 
 
Q