Stuck threads in Endpoint Security extension

I have a weird issue with our Endpoint Security extension.

A couple of the checks we do require calling into Apple frameworks (Security, Disk Arbitration) and these checks can happen early in the boot process. On macOS 13 (and possibly earlier), sometimes these calls get stuck and never return. When this happens, the kernel will kill the extension and this generates a crash log. This adds significant time to the boot, enough that people notice. In every case, the thread where the call into the Apple framework occurred shows that the thread is stuck in mach_msg2_trap(), which I now understand means it's likely waiting on an event or message.

Now here's where things get weird. I discovered that if I shunt the check off onto a Thread subclass and put it in a DispatchGroup (perhaps the wrong primitive), then wait() on that group with my own timeout, the thread will get unstuck within a couple hundred milliseconds of the timeout. The timeout can be a couple of seconds or longer. In every case, the thread unblocks, returns from mach_msg2_trap() and the original call finishes as expected.

Is there a rational explanation for this behavior? Am I crazy to even consider shipping this workaround?

Answered by DTS Engineer in 823960022

On macOS 13 (and possibly earlier), sometimes these calls get stuck and never return. When this happens, the kernel will kill the extension and this generates a crash log.

First off, if you have crash logs then I'd like to see them. I'm fairly confident I understand what's going on (see below) but it's always worth seeing what the log(s) show.

A couple of the checks we do require calling into Apple frameworks (Security, Disk Arbitration) and these checks can happen early in the boot process.

The words "early in the boot process..." can be a BIG red flag when it comes to an ESE. Particularly if you're using "NSEndpointSecurityEarlyBoot", "early" can be very, VERY early, opening the door to all sort of unexpected issues.

Case in point, on it's own diskarbitrationd (the daemon underneath the DiskArb framework) doesn't actually need to run all that "early" in the boot process. The boot device mounts without it and there generally isn't an "urgent" need to automount other volumes until later in the process.

One simply experiment I'd recommend here is to setup a completely "clean" install of macOS, boot and log in as normal, and then simply review the list of all running process sorted by pid value. That list is basically the "natural" launch order of the system. If you repeat that same test multiple times, you'll find some variation between machines (caused by hardware differences) and individual launches on the same machine, but the broad "pattern" is fairly consistent.

That leads to here:

In every case, the thread where the call into the Apple framework occurred shows that the thread is stuck in mach_msg2_trap(), which I now understand means it's likely waiting on an event or message.

Most of our frameworks involve some kind of supporting framework which does the actual "work" of the API. SO, for DiskArb, step 1 of calling into any DiskArb API is "connect to diskarbitrationd". In the early boot process, that also means "launch DiskArb".

There are actually two different cases that can happen here:

  1. If you call an API that generates auth requests to you and you fail to process those auth requests, then you'll deadlock yourself and the system will kill you.

  2. Particularly when "NSEndpointSecurityEarlyBoot" is involved, the EndpointSecurity system may stall/delay other activity while it waits for your extension to "finish" launching.

Note the description of the "NSEndpointSecurityEarlyBoot" plist flag:

NSEndpointSecurityEarlyBoot
    Type: Boolean

	If set to TRUE, the ES subsystem will hold up all mounts and third party
	executions (anything that is not a platform binary) until all early boot
	ES extensions make their first subscription.

In other words, if diskarbitrationd decides to wait or block on a mount before it accepts your XPC connection, then you just deadlocked.

That leads to here:

Now here's where things get weird. I discovered that if I shunt the check off onto a Thread subclass and put it in a DispatchGroup (perhaps the wrong primitive), then wait() on that group with my own timeout, the thread will get unstuck within a couple hundred milliseconds of the timeout. The timeout can be a couple of seconds or longer. In every case, the thread unblocks, returns from mach_msg2_trap() and the original call finishes as expected.

Is there a rational explanation for this behavior?

Yes. I haven't seen your code, but my guess is that you started with a single thread that you were calling DiskArb on. That blocked (because DiskArb stalled) until you were killed during launch.

Your workaround above did something like the following:

  • Moved the blocking code off the initialization thread.

  • That code hangs because the ES system is waiting for you.

  • Your wait call blocks waiting on the stall.

  • The timeout expires and your initialization thread "returns".

  • The ES system allows work to continue, allowing the "block" to clear.

Am I crazy to even consider shipping this workaround?

Sort of, though your workaround is on the right track.

First off, assuming it's the scenario I outlined above is correct, then your wait is adding an additional, pointless, delay. It's better than crashing, but there's no reason to wait at all.

However, the bigger issue here is that you need to be thinking about this in much broader terms, not just specific frameworks or workaround.

I've focused on DiskArb because you're using it and it's easy to explain, but the issue here is much broader than that. More specifically:

  1. This issue could theoretically happen with nearly any of our framework, since none of them actually document their internal implementation in significant detail.

  2. Our implementation could change at any time, which means you can't test/investigate your way around #1.

  3. The real risk here isn't the system "base" implementation, it's the VAST array of other configuration and the possibility of using this kind of delay to create an attack vector.

IMHO, the solution here is that your ESE needs to be able to fully initialize without really calling into any API that might create any "external" activity. You can still use those APIs, but you can't block your own initialization waiting for them to complete or (ideally) REQUIRE them to be operational*.

*Similar to #3, the real issue here are unpredictable edge cases. Again, picking on DiskArb, the normal case is that it's running and working perfectly. However, I can come up with edge cases that might unexpectedly delay it and it's entirely possible it might not be running at all**. If you design with that possibility, then you can degrade/fail gracefully. If you don't, then at some point "something" will break and now you're stuck trying to figure out what happened.

**I have no idea if it's still common but, for example, many forensic apps told users to disable DiskArb as a way to ensure the system wouldn't unexpectedly auto mount the volume.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

On macOS 13 (and possibly earlier), sometimes these calls get stuck and never return. When this happens, the kernel will kill the extension and this generates a crash log.

First off, if you have crash logs then I'd like to see them. I'm fairly confident I understand what's going on (see below) but it's always worth seeing what the log(s) show.

A couple of the checks we do require calling into Apple frameworks (Security, Disk Arbitration) and these checks can happen early in the boot process.

The words "early in the boot process..." can be a BIG red flag when it comes to an ESE. Particularly if you're using "NSEndpointSecurityEarlyBoot", "early" can be very, VERY early, opening the door to all sort of unexpected issues.

Case in point, on it's own diskarbitrationd (the daemon underneath the DiskArb framework) doesn't actually need to run all that "early" in the boot process. The boot device mounts without it and there generally isn't an "urgent" need to automount other volumes until later in the process.

One simply experiment I'd recommend here is to setup a completely "clean" install of macOS, boot and log in as normal, and then simply review the list of all running process sorted by pid value. That list is basically the "natural" launch order of the system. If you repeat that same test multiple times, you'll find some variation between machines (caused by hardware differences) and individual launches on the same machine, but the broad "pattern" is fairly consistent.

That leads to here:

In every case, the thread where the call into the Apple framework occurred shows that the thread is stuck in mach_msg2_trap(), which I now understand means it's likely waiting on an event or message.

Most of our frameworks involve some kind of supporting framework which does the actual "work" of the API. SO, for DiskArb, step 1 of calling into any DiskArb API is "connect to diskarbitrationd". In the early boot process, that also means "launch DiskArb".

There are actually two different cases that can happen here:

  1. If you call an API that generates auth requests to you and you fail to process those auth requests, then you'll deadlock yourself and the system will kill you.

  2. Particularly when "NSEndpointSecurityEarlyBoot" is involved, the EndpointSecurity system may stall/delay other activity while it waits for your extension to "finish" launching.

Note the description of the "NSEndpointSecurityEarlyBoot" plist flag:

NSEndpointSecurityEarlyBoot
    Type: Boolean

	If set to TRUE, the ES subsystem will hold up all mounts and third party
	executions (anything that is not a platform binary) until all early boot
	ES extensions make their first subscription.

In other words, if diskarbitrationd decides to wait or block on a mount before it accepts your XPC connection, then you just deadlocked.

That leads to here:

Now here's where things get weird. I discovered that if I shunt the check off onto a Thread subclass and put it in a DispatchGroup (perhaps the wrong primitive), then wait() on that group with my own timeout, the thread will get unstuck within a couple hundred milliseconds of the timeout. The timeout can be a couple of seconds or longer. In every case, the thread unblocks, returns from mach_msg2_trap() and the original call finishes as expected.

Is there a rational explanation for this behavior?

Yes. I haven't seen your code, but my guess is that you started with a single thread that you were calling DiskArb on. That blocked (because DiskArb stalled) until you were killed during launch.

Your workaround above did something like the following:

  • Moved the blocking code off the initialization thread.

  • That code hangs because the ES system is waiting for you.

  • Your wait call blocks waiting on the stall.

  • The timeout expires and your initialization thread "returns".

  • The ES system allows work to continue, allowing the "block" to clear.

Am I crazy to even consider shipping this workaround?

Sort of, though your workaround is on the right track.

First off, assuming it's the scenario I outlined above is correct, then your wait is adding an additional, pointless, delay. It's better than crashing, but there's no reason to wait at all.

However, the bigger issue here is that you need to be thinking about this in much broader terms, not just specific frameworks or workaround.

I've focused on DiskArb because you're using it and it's easy to explain, but the issue here is much broader than that. More specifically:

  1. This issue could theoretically happen with nearly any of our framework, since none of them actually document their internal implementation in significant detail.

  2. Our implementation could change at any time, which means you can't test/investigate your way around #1.

  3. The real risk here isn't the system "base" implementation, it's the VAST array of other configuration and the possibility of using this kind of delay to create an attack vector.

IMHO, the solution here is that your ESE needs to be able to fully initialize without really calling into any API that might create any "external" activity. You can still use those APIs, but you can't block your own initialization waiting for them to complete or (ideally) REQUIRE them to be operational*.

*Similar to #3, the real issue here are unpredictable edge cases. Again, picking on DiskArb, the normal case is that it's running and working perfectly. However, I can come up with edge cases that might unexpectedly delay it and it's entirely possible it might not be running at all**. If you design with that possibility, then you can degrade/fail gracefully. If you don't, then at some point "something" will break and now you're stuck trying to figure out what happened.

**I have no idea if it's still common but, for example, many forensic apps told users to disable DiskArb as a way to ensure the system wouldn't unexpectedly auto mount the volume.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin. Thanks for your detailed reply.

We are not using NSEndpointSecurityEarlyBoot. None of the calls into Apple frameworks that we've seen lead to killing the extension are during our extension initialization. They're all in response to some event, e.g., ES_EVENT_TYPE_AUTH_MOUNT calls into DiskArbitration, ES_EVENT_TYPE_AUTH_OPEN calls into Security. Important to note: we only care about those OPEN events for a restricted set of paths (our files) and immediately return ALLOW for anything else.

Here is a lightly redacted crash log:

Process:               com.redacted.EndpointSecurity [492]
Path:                  /Library/SystemExtensions/*/com.redacted.EndpointSecurity
Identifier:            com.redacted.EndpointSecurity
Version:               v2.10.0-21-g35018b949c-dirty (58)Code Type:             ARM-64 (Native)
Parent Process:        launchd [1]
User ID:               0

Date/Time:             2025-02-04 12:18:33.7447 -0500
OS Version:            macOS 13.6.7 (22G720)
Report Version:        12
Anonymous UUID:        6570580F-1EF2-E6B5-E10B-CA9F00455210

Time Awake Since Boot: 58 seconds

System Integrity Protection: enabled

Crashed Thread:        1

Exception Type:        EXC_CRASH (SIGKILL)
Exception Codes:       0x0000000000000000, 0x0000000000000000

Termination Reason:    Namespace ENDPOINTSECURITY, Code 2 EndpointSecurity client terminated because it failed to respond to a message before its deadline

Thread 0:
0   libsystem_pthread.dylib       	       0x18836ad8c start_wqthread + 0

Thread 1 Crashed:
0   libsystem_kernel.dylib        	       0x1883375c8 __sigsuspend_nocancel + 8
1   libdispatch.dylib             	       0x1881d3ba8 _dispatch_sigsuspend + 48
2   libdispatch.dylib             	       0x1881d3b78 _dispatch_sig_thread + 60

Thread 2::  Dispatch queue: BBReaderQueue
0   libsystem_kernel.dylib        	       0x18832fef4 mach_msg2_trap + 8
1   libsystem_kernel.dylib        	       0x188342220 mach_msg2_internal + 80
2   libsystem_kernel.dylib        	       0x188338b58 mach_msg_overwrite + 604
3   libsystem_kernel.dylib        	       0x188330270 mach_msg + 24
4   CarbonCore                    	       0x18b17b5f0 _scclient_ServerCheckinWithResult_rpc + 148
5   CarbonCore                    	       0x18b17b3dc SCClientSession::checkinWithServer(unsigned int*) + 224
6   CarbonCore                    	       0x18b17b0f0 connectToCoreServicesD() + 128
7   CarbonCore                    	       0x18b17af1c getStatus() + 64
8   CarbonCore                    	       0x18b17e308 scCreateSystemServiceVersion + 56
9   CarbonCore                    	       0x18b17e17c FileIDTreeGetCachedPort + 276
10  CarbonCore                    	       0x18b17df74 FSNodeStorageGetAndLockCurrentUniverse + 68
11  CarbonCore                    	       0x18b17dd64 FileIDTreeGetAndLockVolumeEntryForDeviceID + 88
12  CarbonCore                    	       0x18b17dbcc FSMount::FSMount(unsigned int, FSMountNumberType, int*, unsigned int const*) + 92
13  CarbonCore                    	       0x18b17db28 FSMountPrepare + 76
14  CoreServicesInternal          	       0x18b4845dc MountInfoPrepare + 68
15  CoreServicesInternal          	       0x18b483f9c parseAttributeBuffer(__CFAllocator const*, unsigned char const*, unsigned char, attrlist const*, void const*, void**, _FileAttributes*, unsigned int*) + 2848
16  CoreServicesInternal          	       0x18b482f58 corePropertyProviderPrepareValues(__CFURL const*, __FileCache*, __CFString const* const*, void const**, long, void const*, __CFError**) + 1248
17  CoreServicesInternal          	       0x18b482a04 prepareValuesForBitmap(__CFURL const*, __FileCache*, _FilePropertyBitmap*, __CFError**) + 452
18  CoreServicesInternal          	       0x18b47f86c _FSURLCopyResourcePropertyForKeyInternal(__CFURL const*, __CFString const*, void*, void*, __CFError**, unsigned char) + 232
19  CoreFoundation                	       0x188422f5c CFURLCopyResourcePropertyForKey + 108
20  Security                      	       0x18ac34fcc SecTrustEvaluateIfNecessary + 132
21  Security                      	       0x18ac3699c SecTrustEvaluateInternal + 48
22  Security                      	       0x18acc4fd4 decodeTimeStampTokenWithPolicy + 1320
23  Security                      	       0x18acbf0a8 SecCmsSignerInfoVerifyUnAuthAttrsWithPolicy + 96
24  Security                      	       0x18acc6294 SecCmsSignedDataVerifySignerInfo_internal + 696
25  Security                      	       0x18acc6b60 CMSDecoderCopySignerStatus + 164
26  Security                      	       0x18acde4d0 Security::CodeSigning::SecStaticCode::validateDirectory() + 956
27  Security                      	       0x18ace2240 Security::CodeSigning::SecStaticCode::validateNonResourceComponents() + 24
28  Security                      	       0x18ace6c38 Security::CodeSigning::SecStaticCode::staticValidateCore(unsigned int, Security::CodeSigning::SecRequirement const*) + 72
29  Security                      	       0x18ace5c7c Security::CodeSigning::SecStaticCode::staticValidate(unsigned int, Security::CodeSigning::SecRequirement const*) + 308
30  Security                      	       0x18acdae98 SecStaticCodeCheckValidityWithErrors + 228
31  com.redacted.EndpointSecurity	       0x1004396ec 0x100430000 + 38636
32  com.redacted.EndpointSecurity	       0x10043489c 0x100430000 + 18588
33  com.redacted.EndpointSecurity	       0x1004361cc 0x100430000 + 25036
34  com.redacted.EndpointSecurity	       0x100434284 0x100430000 + 17028
35  libEndpointSecurity.dylib     	       0x19af7d7d0 BBReader<ESMessageReaderConfig>::handleItems() + 356
36  libEndpointSecurity.dylib     	       0x19af7d558 BBReader<ESMessageReaderConfig>::woke(void*) + 28
37  libdispatch.dylib             	       0x1881c0400 _dispatch_client_callout + 20
38  libdispatch.dylib             	       0x1881c3884 _dispatch_continuation_pop + 504
39  libdispatch.dylib             	       0x1881d6e7c _dispatch_source_invoke + 1588
40  libdispatch.dylib             	       0x1881c7960 _dispatch_lane_serial_drain + 372
41  libdispatch.dylib             	       0x1881c862c _dispatch_lane_invoke + 436
42  libdispatch.dylib             	       0x1881c98e8 _dispatch_workloop_invoke + 1764
43  libdispatch.dylib             	       0x1881d3244 _dispatch_workloop_worker_thread + 648
44  libsystem_pthread.dylib       	       0x18836c074 _pthread_wqthread + 288
45  libsystem_pthread.dylib       	       0x18836ad94 start_wqthread + 8

Thread 3:
0   libsystem_pthread.dylib       	       0x18836ad8c start_wqthread + 0```
Accepted Answer

We are not using NSEndpointSecurityEarlyBoot. None of the calls into Apple frameworks that we've seen lead to killing the extension are during our extension initialization. They're all in response to some event, e.g., ES_EVENT_TYPE_AUTH_MOUNT calls into DiskArbitration, ES_EVENT_TYPE_AUTH_OPEN calls into Security. Important to note: we only care about those OPEN events for a restricted set of paths (our files) and immediately return ALLOW for anything else.

The stack you posted is a standard example of what I described here:

...If you call an API that generates auth requests to you and you fail to process those auth requests, then you'll deadlock yourself and the system will kill you.

That is, your code is running on directly in the event delivery callback:

34  com.redacted.EndpointSecurity	       0x100434284 0x100430000 + 17028
35  libEndpointSecurity.dylib     	       0x19af7d7d0 BBReader<ESMessageReaderConfig>::handleItems() + 356
36  libEndpointSecurity.dylib     	       0x19af7d558 BBReader<ESMessageReaderConfig>::woke(void*) + 28
37  libdispatch.dylib             	       0x1881c0400 _dispatch_client_callout + 20

It called into a system API:

30  Security                      	       0x18acdae98 SecStaticCodeCheckValidityWithErrors + 228
31  com.redacted.EndpointSecurity	       0x1004396ec 0x100430000 + 38636

And that system API then called out to a daemon and our code is now waiting to hear back from that daemon:

3   libsystem_kernel.dylib        	       0x188330270 mach_msg + 24
4   CarbonCore                    	       0x18b17b5f0 _scclient_ServerCheckinWithResult_rpc + 148
...
14  CoreServicesInternal          	       0x18b4845dc MountInfoPrepare + 68
15  CoreServicesInternal          	       0x18b483f9c parseAttributeBuffer(__CFAllocator const*, unsigned char const*, unsigned char, attrlist const*, void const*, void**, _FileAttributes*, unsigned int*) + 2848

If that daemon or any daemon THAT daemon (yep, it's daemon's all the way down...) calls into makes ANY call that generates and auth call back into you... then you'll deadlock against yourself, leading to your apps termination.

Anticipating follow up questions:

1) Why does it only/often happen at startup?

Just like most apps, our daemon's often read configuration files or do other one off initialization at launch. That means they'll often generate different auth/notify activity at "boot" than they would at any other time.

**2) What APIs can I safely call inside my direct ESE callback? ** None? Or, more specifically, you can ONLY call functions that you are absolutely certain will NEVER block under ANY circumstances. Strictly speaking, you could call basic primitive like taking a lock, however, even that is potentially dangerous. For example, if the holder of the lock on another thread called into SecStaticCodeCheckValidityWithErrors under the same conditions, then you'd generate a fairly similar crash somewhat more randomly (depending on race conditions). In practice that means that:

  • You can ONLY do static analysis (meaning analysis based on the data available in the message itself) on the receiving handler.

  • Any analysis that calls into any system API needs to happen "off" the delivery queue.

  • Any data shared with the static analysis engine needs to be managed such that it ensures the static analysis engine will never block.

Note that while our ESE sample project does implement this pattern, I don't think I'd recommend blindly copying it's architecture. Looking at our code, "handle_exec" is doing static analysis, while "handle_open" does this:

static void
init_dispatch_queue(void)
{
	// Choose an appropriate Quality of Service class appropriate for your app.
	// https://developer.apple.com/documentation/dispatch/dispatchqos
	dispatch_queue_attr_t queue_attrs = dispatch_queue_attr_make_with_qos_class(
			DISPATCH_QUEUE_CONCURRENT, QOS_CLASS_USER_INITIATED, 0);

	g_event_queue = dispatch_queue_create("event_queue", queue_attrs);
}

...
static void
handle_open(es_client_t *client, const es_message_t *msg)
{
...
	es_retain_message(msg);

	dispatch_async(g_event_queue, ^{
		handle_open_worker(client, msg);
		es_release_message(msg);
	});
}

g_event_queue is a concurrent queue, so at a purely technical level this is fairly safe, particularly given the limited work that's actually happening in handle_open_worker. However, in real world conditions where more work is going on, this can easy cause a GCD thread explosion with corresponding performance loss and/or miss prioritization of work.

This quickly leads into a much broader architecture topic, but I'd recommend reviewing this post and this post.

3) This makes me think you're using dispatch_main:

Thread 0:
0   libsystem_pthread.dylib       	       0x18836ad8c start_wqthread + 0

...and I think that's a mistake. You can read more about the issue on this thread, but my recommendation would be to block your main thread using NSRunLoop or CFRunLoop (doesn't actually matter which). That ensures you have a "standard" main thread which:

"My "default" answer here would probably be #1, as the provides access to the broadest possible API set with the lowest possibility of failure."

More fundamentally, I can think of all sorts of odd edge cases which dispatch_main can create and NONE (yes, NONE) which NS/CFRunLoop creates.

4) Doing your own code sign verification like this:

30  Security                      	       0x18acdae98 SecStaticCodeCheckValidityWithErrors + 228
31  com.redacted.EndpointSecurity	       0x1004396ec 0x100430000 + 38636

...is very likely to be a mistake. The problem here is that you can replace open files, which mean you can launch "app1" then replace it with "app2", tricking your engine into thinking "app1" is actually "app2". That issue is why the codesign data is included in the es_process_t stuct. That codesign data is the data the kernel is actually working off of, not whatever happens to be on disk.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Stuck threads in Endpoint Security extension
 
 
Q