We are not using NSEndpointSecurityEarlyBoot. None of the calls into Apple frameworks that we've seen lead to killing the extension are during our extension initialization. They're all in response to some event, e.g., ES_EVENT_TYPE_AUTH_MOUNT calls into DiskArbitration, ES_EVENT_TYPE_AUTH_OPEN calls into Security. Important to note: we only care about those OPEN events for a restricted set of paths (our files) and immediately return ALLOW for anything else.
The stack you posted is a standard example of what I described here:
...If you call an API that generates auth requests to you and you fail to process those auth requests, then you'll deadlock yourself and the system will kill you.
That is, your code is running on directly in the event delivery callback:
34 com.redacted.EndpointSecurity 0x100434284 0x100430000 + 17028
35 libEndpointSecurity.dylib 0x19af7d7d0 BBReader<ESMessageReaderConfig>::handleItems() + 356
36 libEndpointSecurity.dylib 0x19af7d558 BBReader<ESMessageReaderConfig>::woke(void*) + 28
37 libdispatch.dylib 0x1881c0400 _dispatch_client_callout + 20
It called into a system API:
30 Security 0x18acdae98 SecStaticCodeCheckValidityWithErrors + 228
31 com.redacted.EndpointSecurity 0x1004396ec 0x100430000 + 38636
And that system API then called out to a daemon and our code is now waiting to hear back from that daemon:
3 libsystem_kernel.dylib 0x188330270 mach_msg + 24
4 CarbonCore 0x18b17b5f0 _scclient_ServerCheckinWithResult_rpc + 148
...
14 CoreServicesInternal 0x18b4845dc MountInfoPrepare + 68
15 CoreServicesInternal 0x18b483f9c parseAttributeBuffer(__CFAllocator const*, unsigned char const*, unsigned char, attrlist const*, void const*, void**, _FileAttributes*, unsigned int*) + 2848
If that daemon or any daemon THAT daemon (yep, it's daemon's all the way down...) calls into makes ANY call that generates and auth call back into you... then you'll deadlock against yourself, leading to your apps termination.
Anticipating follow up questions:
(1) Why does it only/often happen at startup?
Just like most apps, our daemon's often read configuration files or do other one off initialization at launch. That means they'll often generate different auth/notify activity at "boot" than they would at any other time.
(2) What APIs can I safely call inside my direct ESE callback?
None? Or, more specifically, you can ONLY call functions that you are absolutely certain will NEVER block under ANY circumstances. Strictly speaking, you could call basic primitive like taking a lock, however, even that is potentially dangerous. For example, if the holder of the lock on another thread called into SecStaticCodeCheckValidityWithErrors under the same conditions, then you'd generate a fairly similar crash somewhat more randomly (depending on race conditions). In practice that means that:
-
You can ONLY do static analysis (meaning analysis based on the data available in the message itself) on the receiving handler.
-
Any analysis that calls into any system API needs to happen "off" the delivery queue.
-
Any data shared with the static analysis engine needs to be managed such that it ensures the static analysis engine will never block.
Note that while our ESE sample project does implement this pattern, I don't think I'd recommend blindly copying it's architecture. Looking at our code, "handle_exec" is doing static analysis, while "handle_open" does this:
static void
init_dispatch_queue(void)
{
// Choose an appropriate Quality of Service class appropriate for your app.
// https://developer.apple.com/documentation/dispatch/dispatchqos
dispatch_queue_attr_t queue_attrs = dispatch_queue_attr_make_with_qos_class(
DISPATCH_QUEUE_CONCURRENT, QOS_CLASS_USER_INITIATED, 0);
g_event_queue = dispatch_queue_create("event_queue", queue_attrs);
}
...
static void
handle_open(es_client_t *client, const es_message_t *msg)
{
...
es_retain_message(msg);
dispatch_async(g_event_queue, ^{
handle_open_worker(client, msg);
es_release_message(msg);
});
}
g_event_queue is a concurrent queue, so at a purely technical level this is fairly safe, particularly given the limited work that's actually happening in handle_open_worker. However, in real world conditions where more work is going on, this can easy cause a GCD thread explosion with corresponding performance loss and/or miss prioritization of work.
This quickly leads into a much broader architecture topic, but I'd recommend reviewing this post and this post.
(3) This makes me think you're using dispatch_main:
Thread 0:
0 libsystem_pthread.dylib 0x18836ad8c start_wqthread + 0
...and I think that's a mistake. You can read more about the issue on this thread, but my recommendation would be to block your main thread using NSRunLoop or CFRunLoop (doesn't actually matter which). That ensures you have a "standard" main thread which:
"My "default" answer here would probably be #1, as the provides access to the broadest possible API set with the lowest possibility of failure."
More fundamentally, I can think of all sorts of odd edge cases which dispatch_main can create and NONE (yes, NONE) which NS/CFRunLoop creates.
(4) Doing your own code sign verification like this:
30 Security 0x18acdae98 SecStaticCodeCheckValidityWithErrors + 228
31 com.redacted.EndpointSecurity 0x1004396ec 0x100430000 + 38636
...is very likely to be a mistake. The problem here is that you can replace open files, which mean you can launch "app1" then replace it with "app2", tricking your engine into thinking "app1" is actually "app2". That issue is why the codesign data is included in the es_process_t stuct. That codesign data is the data the kernel is actually working off of, not whatever happens to be on disk.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware