I have a weird issue with our Endpoint Security extension.
A couple of the checks we do require calling into Apple frameworks (Security, Disk Arbitration) and these checks can happen early in the boot process. On macOS 13 (and possibly earlier), sometimes these calls get stuck and never return. When this happens, the kernel will kill the extension and this generates a crash log. This adds significant time to the boot, enough that people notice. In every case, the thread where the call into the Apple framework occurred shows that the thread is stuck in mach_msg2_trap()
, which I now understand means it's likely waiting on an event or message.
Now here's where things get weird. I discovered that if I shunt the check off onto a Thread
subclass and put it in a DispatchGroup
(perhaps the wrong primitive), then wait()
on that group with my own timeout, the thread will get unstuck within a couple hundred milliseconds of the timeout. The timeout can be a couple of seconds or longer. In every case, the thread unblocks, returns from mach_msg2_trap()
and the original call finishes as expected.
Is there a rational explanation for this behavior? Am I crazy to even consider shipping this workaround?
On macOS 13 (and possibly earlier), sometimes these calls get stuck and never return. When this happens, the kernel will kill the extension and this generates a crash log.
First off, if you have crash logs then I'd like to see them. I'm fairly confident I understand what's going on (see below) but it's always worth seeing what the log(s) show.
A couple of the checks we do require calling into Apple frameworks (Security, Disk Arbitration) and these checks can happen early in the boot process.
The words "early in the boot process..." can be a BIG red flag when it comes to an ESE. Particularly if you're using "NSEndpointSecurityEarlyBoot", "early" can be very, VERY early, opening the door to all sort of unexpected issues.
Case in point, on it's own diskarbitrationd (the daemon underneath the DiskArb framework) doesn't actually need to run all that "early" in the boot process. The boot device mounts without it and there generally isn't an "urgent" need to automount other volumes until later in the process.
One simply experiment I'd recommend here is to setup a completely "clean" install of macOS, boot and log in as normal, and then simply review the list of all running process sorted by pid value. That list is basically the "natural" launch order of the system. If you repeat that same test multiple times, you'll find some variation between machines (caused by hardware differences) and individual launches on the same machine, but the broad "pattern" is fairly consistent.
That leads to here:
In every case, the thread where the call into the Apple framework occurred shows that the thread is stuck in mach_msg2_trap(), which I now understand means it's likely waiting on an event or message.
Most of our frameworks involve some kind of supporting framework which does the actual "work" of the API. SO, for DiskArb, step 1 of calling into any DiskArb API is "connect to diskarbitrationd". In the early boot process, that also means "launch DiskArb".
There are actually two different cases that can happen here:
-
If you call an API that generates auth requests to you and you fail to process those auth requests, then you'll deadlock yourself and the system will kill you.
-
Particularly when "NSEndpointSecurityEarlyBoot" is involved, the EndpointSecurity system may stall/delay other activity while it waits for your extension to "finish" launching.
Note the description of the "NSEndpointSecurityEarlyBoot" plist flag:
NSEndpointSecurityEarlyBoot
Type: Boolean
If set to TRUE, the ES subsystem will hold up all mounts and third party
executions (anything that is not a platform binary) until all early boot
ES extensions make their first subscription.
In other words, if diskarbitrationd decides to wait or block on a mount before it accepts your XPC connection, then you just deadlocked.
That leads to here:
Now here's where things get weird. I discovered that if I shunt the check off onto a Thread subclass and put it in a DispatchGroup (perhaps the wrong primitive), then wait() on that group with my own timeout, the thread will get unstuck within a couple hundred milliseconds of the timeout. The timeout can be a couple of seconds or longer. In every case, the thread unblocks, returns from mach_msg2_trap() and the original call finishes as expected.
Is there a rational explanation for this behavior?
Yes. I haven't seen your code, but my guess is that you started with a single thread that you were calling DiskArb on. That blocked (because DiskArb stalled) until you were killed during launch.
Your workaround above did something like the following:
-
Moved the blocking code off the initialization thread.
-
That code hangs because the ES system is waiting for you.
-
Your wait call blocks waiting on the stall.
-
The timeout expires and your initialization thread "returns".
-
The ES system allows work to continue, allowing the "block" to clear.
Am I crazy to even consider shipping this workaround?
Sort of, though your workaround is on the right track.
First off, assuming it's the scenario I outlined above is correct, then your wait is adding an additional, pointless, delay. It's better than crashing, but there's no reason to wait at all.
However, the bigger issue here is that you need to be thinking about this in much broader terms, not just specific frameworks or workaround.
I've focused on DiskArb because you're using it and it's easy to explain, but the issue here is much broader than that. More specifically:
-
This issue could theoretically happen with nearly any of our framework, since none of them actually document their internal implementation in significant detail.
-
Our implementation could change at any time, which means you can't test/investigate your way around #1.
-
The real risk here isn't the system "base" implementation, it's the VAST array of other configuration and the possibility of using this kind of delay to create an attack vector.
IMHO, the solution here is that your ESE needs to be able to fully initialize without really calling into any API that might create any "external" activity. You can still use those APIs, but you can't block your own initialization waiting for them to complete or (ideally) REQUIRE them to be operational*.
*
Similar to #3, the real issue here are unpredictable edge cases. Again, picking on DiskArb, the normal case is that it's running and working perfectly. However, I can come up with edge cases that might unexpectedly delay it and it's entirely possible it might not be running at all**. If you design with that possibility, then you can degrade/fail gracefully. If you don't, then at some point "something" will break and now you're stuck trying to figure out what happened.
**
I have no idea if it's still common but, for example, many forensic apps told users to disable DiskArb as a way to ensure the system wouldn't unexpectedly auto mount the volume.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware