EndPointSecurity system extension crashing due to deadline

Hi ,

Greetings of the day!

I would like to get help to avoid the Endpoint Security System Extension crash due to below reason:

Termination Reason:    Namespace ENDPOINTSECURITY, Code 2 EndpointSecurity client terminated because it failed to respond to a message before its deadline

Couple of events we have subscribed and for AUTH related events we are receiving deadline of 14 seconds in Sonoma and to avoid above issue we have implemented a queue to provide verdict within the deadline to avoid the OS killing of our extension however sometime we observe that we are getting crash with below message:

Termination Reason:    Namespace ENDPOINTSECURITY, Code 2 EndpointSecurity client terminated because it failed to respond to a message before its deadline

**Dispatch Thread Soft Limit Reached: 64** (too many dispatch threads blocked in synchronous operations)

There is no GCD API to check whether queue is reached to soft limit so we need help here to know or check whether queue is reached to soft limit 64.

if we can check above then we should avoid adding the new tasks in it until its free to accept the tasks.

And for NOTIFY_CLOSE, we are getting big value in seconds as deadline however we are adding all the processing of NOTIFY_CLOSE with dispatch_async however still receiving the crash.

Here is code for AUTH_OPEN :

dispatch_queue_t gNotifyCloseQueue = dispatch_queue_create(
     "com.example.notify_close_queue", dispatch_queue_attr_make_with_qos_class(DISPATCH_QUEUE_CONCURRENT_WITH_AUTORELEASE_POOL,
                                             QOS_CLASS_UTILITY, 0));
dispatch_queue_t gAuthOpenQueue = dispatch_queue_create("com.example.auth_open_queue",dispatch_queue_attr_make_with_qos_class(DISPATCH_QUEUE_CONCURRENT_WITH_AUTORELEASE_POOL,QOS_CLASS_USER_INTERACTIVE, 0));
BOOL AuthOpenEventHandler(es_message_t *pesMsg)
{
    //Some Processing we are doing here like Calculate the deadline in seconds etc. and we are receiving 14 seconds in Sonoma

    // deadline - 14 seconds
    if ( deadlineInSeconds < 10 )
    {
        dispatch_time_t triggerTime = dispatch_time(pesMsg->deadline, (int64_t)(-1 * NSEC_PER_SEC));
        
        __block es_message_t *pesTempMsg;
        pesTempMsg = es_copy_message(pesMsg);
        
        dispatch_after(triggerTime, gAuthOpenQueue, ^{
            
            if (pesTempMsg != NULL)
            {
                esRespondRes = es_respond_flags_result(pesClt,pesMsg,pesMsg->event.open.fflag,false);
                if(ES_RESPOND_RESULT_SUCCESS != esRespondRes)
                {
                    es_free_message(pesTempMsg);
                    return;
                }
                if (pesTempMsg != NULL) {
                    es_free_message(pesTempMsg);
                }
            }
            return;
        });
    }
    // Some Processing we are doing here to provide verdict and we are making sure that within 11 seconds we are setting the verdict
    // we are setting iRetFlag here based on verdict
    
    if (NULL != pesMsg)
    {
        esRespondRes = es_respond_flags_result(pesClt,pesMsg,iRetFlag,false);
        if(ES_RESPOND_RESULT_SUCCESS != esRespondRes)
        {
            es_free_message(pesMsg);
            return FALSE;
        }
    }
    return TRUE;
}

Here is the code for NOTIFY_CLOSE:

BOOL NotifyEventHandler(es_message_t *pesMessage)
{
    if (pesMessage->event_type == ES_EVENT_TYPE_NOTIFY_CLOSE && YES == pesMessage->event.close.modified)
    {
        __block es_message_t *pesTempMsg;
        pesTempMsg = es_copy_message(pesMessage);
        
        dispatch_async(gNotifyCloseQueue, ^{
            
            // Performing Some processing on es_message_t
            
            if (pesTempMsg != NULL)
            {
                es_free_message(pesTempMsg);
            }
        });
        
        if (pesMessage != NULL)
        {
            es_free_message(pesMessage);
        }
    }
    else
    {
        es_free_message(pesMessage);
    }
    return TRUE;
}

It would be helpful if someone help us to identify what could be wrong we are doing in above code and how to address/solve those problems (code snippet would be helpful) to avoid all possible crashes.

...

Thanks & Regards,

Mohamed Vasim

Answered by DTS Engineer in 796051022

Couple of events we have subscribed and for AUTH related events we are receiving deadline of 14 seconds

A few quick notes on this:

  • You should be aware that there are cases where the deadline could be significantly smaller than ~15s, generally because the request is coming from a system component where delaying events could cause system stability issues. Most of those processes are included in the default mute set (you can retrieve the currently muted processes through the different "es_muted..." APIs), but you can still see them, particularly if you alter the muted process list.

  • You must be able to process events FAR faster than the deadline you receive, with the general guideline being <100ms. The main reason the deadline exists is NOT to describe the available processing time, but to ensure your extension has time to process events even when the system is under VERY heavy load and/or subject to SEVERE scheduling delays.

  • If you're doing I/O (file or network) as part of responding to auth events, you basically just need to stop doing that entirely. The issue here isn't simply performance (which is a huge problem), it's that any attacker is likely to have enough access to the system that they can skew I/O performance to the point that the system kills you instead.

In terms of the GCD limits here:

There is no GCD API to check whether queue is reached to soft limit so we need help here to know or check whether queue is reached to soft limit 64.

Keep in mind that most of the GCD limits (total thread counts as well as the block limit) are set WELL above what's likely to provide optimum behavior. You're almost certainly better off using a much smaller limit (my guess would be 4-8 threads?).

In any case, while GCD doesn't let you control the queue width, NSOperationQueue.maxConcurrentOperationCount does. You can use NSOperationQueue as a fairly direct equivalent to dispatch queue, however, it also provides additional functionality like cancellation and a base class you can use to represent work (as well as using blocks).

And for NOTIFY_CLOSE, we are getting big value in seconds as deadline however we are adding all the processing of NOTIFY_CLOSE with dispatch_async however still receiving the crash.

There are actually two issues to be aware of:

  1. The engineering team recommends creating multiple es clients (using es_new_client()) and then handling auth and notify events on separate clients. The issue here is that auth and notify events have totally different processing and delivery goals. Notify events are designed to be queued up and delivered in batches, while auth events need to be delivered and responded to with the lowest possible latency. Processing both on the same client means that notify events can't really be queued (minimizing any benefit) and the "noise" of notify events will delay auth events. Note that this is almost certainly what actually caused your crash- not the NOTIFY_CLOSE event itself, but the auth event the system wasn't able to deliver before the deadline was reached.

  2. In most case, you're better off processing notify event "serially", instead of in parallel (like auth events). "Fast" processing generally isn't critical to your overall functionality and you'll typically get better overall throughput by processing them sequentially on a single thread instead of throwing more threads at the problem.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Couple of events we have subscribed and for AUTH related events we are receiving deadline of 14 seconds

A few quick notes on this:

  • You should be aware that there are cases where the deadline could be significantly smaller than ~15s, generally because the request is coming from a system component where delaying events could cause system stability issues. Most of those processes are included in the default mute set (you can retrieve the currently muted processes through the different "es_muted..." APIs), but you can still see them, particularly if you alter the muted process list.

  • You must be able to process events FAR faster than the deadline you receive, with the general guideline being <100ms. The main reason the deadline exists is NOT to describe the available processing time, but to ensure your extension has time to process events even when the system is under VERY heavy load and/or subject to SEVERE scheduling delays.

  • If you're doing I/O (file or network) as part of responding to auth events, you basically just need to stop doing that entirely. The issue here isn't simply performance (which is a huge problem), it's that any attacker is likely to have enough access to the system that they can skew I/O performance to the point that the system kills you instead.

In terms of the GCD limits here:

There is no GCD API to check whether queue is reached to soft limit so we need help here to know or check whether queue is reached to soft limit 64.

Keep in mind that most of the GCD limits (total thread counts as well as the block limit) are set WELL above what's likely to provide optimum behavior. You're almost certainly better off using a much smaller limit (my guess would be 4-8 threads?).

In any case, while GCD doesn't let you control the queue width, NSOperationQueue.maxConcurrentOperationCount does. You can use NSOperationQueue as a fairly direct equivalent to dispatch queue, however, it also provides additional functionality like cancellation and a base class you can use to represent work (as well as using blocks).

And for NOTIFY_CLOSE, we are getting big value in seconds as deadline however we are adding all the processing of NOTIFY_CLOSE with dispatch_async however still receiving the crash.

There are actually two issues to be aware of:

  1. The engineering team recommends creating multiple es clients (using es_new_client()) and then handling auth and notify events on separate clients. The issue here is that auth and notify events have totally different processing and delivery goals. Notify events are designed to be queued up and delivered in batches, while auth events need to be delivered and responded to with the lowest possible latency. Processing both on the same client means that notify events can't really be queued (minimizing any benefit) and the "noise" of notify events will delay auth events. Note that this is almost certainly what actually caused your crash- not the NOTIFY_CLOSE event itself, but the auth event the system wasn't able to deliver before the deadline was reached.

  2. In most case, you're better off processing notify event "serially", instead of in parallel (like auth events). "Fast" processing generally isn't critical to your overall functionality and you'll typically get better overall throughput by processing them sequentially on a single thread instead of throwing more threads at the problem.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

EndPointSecurity system extension crashing due to deadline
 
 
Q