Loss of connectivity with DNS proxy and content filter extensions running at the same time

The environment:

  • macOS Ventura 13.3.1 on a Mac mini (2018 model)
  • A content filter extension (content-filter-provider), configured with filterSockets = YES, filterPackets = NO
  • A DNS proxy extension (dns-proxy)
  • An Ethernet network and two Wi-Fi access points
  • OpenVPN

The exact conditions to reproduce the issue are obscure. So far, we managed to narrow it down to:

  • Connect to the Ethernet network
  • Connect to Wi-Fi access point A, and wait a couple minutes
  • Connect to Wi-Fi access point B (in our tests, this is a mobile hotspot; not sure if relevant)
  • Establish an OpenVPN tunnel
  • Put machine to sleep, and wait 15 minutes
  • Disable Wi-Fi access point B, and wait 3 hours
  • Wake up the machine

The result is a complete loss of network connectivity: the machine can't even renew the DHCP lease

The content filter logs new flows without blocking or pausing them. If a flow is missing some address information, it peeks 1 incoming byte or 1 outgoing byte, whichever comes first ([NEFilterNewFlowVerdict filterDataVerdictWithFilterInbound:YES peekInboundBytes:1 filterOutbound:YES peekOutboundBytes:1]), in the hope to get more complete flow metadata once data is being exchanged. Most importantly, if a flow looks like a DNS request over UDP, the content filter sniffs all incoming data, asking for at most 10000 bytes at a time ([NEFilterNewFlowVerdict filterDataVerdictWithFilterInbound:YES peekInboundBytes:10000 filterOutbound:YES peekOutboundBytes:1]). The DNS answer sniffer has already caused conflict with the DNS proxy in the past, but we fixed it.

I don't know the DNS proxy in as much detail, but I know that it has both blacklisting and proxying features (it proxies requests to a CloudFlare API, I believe), so it's a little more complex than a simple pass-through like the content filter

The last event logged by the content filter is, in fact, a DNS query whose answer is being sniffed. Then it stops logging anything, coinciding with the system-wide loss of connectivity (verified from the logs of other applications and services)

Spindump shows that the content filter is idle, stuck in NetworkExtension internals, which in turn are stuck in an event wait (lck_mtx_sleep). Also of note, mDNSResponder and nesessionmanager are both stuck in a kernel mode call to soflow_db_apply, as a result of closeing a socket. In all cases, by "stuck", I mean that the thread was found to be executing that code in 1000 samples out of 1000, with no use of CPU time

An anonymized subset of the log is attached:

Are we doing anything obviously wrong? Any gotchas that we may have tripped onto in our extension implementations? Or is it possibly a system bug that we should open a radar for? Thank you

Answered by tstudent in 754894022

Apologies for the confusion, @meaton, but it wasn't a question, I was anticipating a question. We know for a fact that the content filter extension is involved, because as soon as it's gone (uninstalling, disabling or killing all work), network connectivity is regained (with some residual issues, though, e.g. Chrome works fine but Safari doesn't)

Yes, we are trying to isolate the issue by making further tests with some of the variables changed. In fact, I believe we're currently performing a test with the NEFilterDataProvider alone

I'm not sure I understand what you mean "start with the APIs that you are using". Do you want me to give you an overview of the APIs used by the network extensions, or are you referring to the tests you suggested to perform, where we test one API at the time?

Finally, I believe that I have successfully decoded the private kernel APIs in the stack backtraces of the stuck threads. My best guess for the network extension thread is:

1000  -[NEFilterSocketFlow createDataReply:controlSocket:direction:verdict:context:] + 372 (NetworkExtension + 395450) [0x7ff8183728ba]
  1000  <patched truncated backtrace>
    1000  write + 10 (libsystem_kernel.dylib + 17242) [0x7ff80781a35a]
     *1000  hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
       *1000  unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
         *1000  write + 289 (kernel + 6863105) [0xffffff8000967901]
           *1000  writev_internal ??? (kernel + 6862327) [0xffffff80009675f7]
             *1000  fp_writev ??? (kernel + 6861353) [0xffffff8000967229]
               *1000  soo_write ??? (kernel + 6883640) [0xffffff800096c938]
                 *1000  sosend + 2519 (kernel + 7064071) [0xffffff8000998a07]
                   *1000  ctl_send ??? (kernel + 6536106) [0xffffff8000917baa]
                     *1000  cfil_ctl_send ??? (kernel + 4997925) [0xffffff80007a0325]
                       *1000  cfil_action_data_pass ??? (kernel + 5007513) [0xffffff80007a2899]
                         *1000  cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
                           *1000  _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
                             *1000  lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]

It looks like the thread is taking the socket referenced by the NEFilterSocketFlow and attempting to acquire the lock of its send or receive buffer, but the lock was acquired by another thread and never released (for hours/days)

For the mDNSResponder thread stuck in close, I guessed:

1000  _dispatch_source_cancel_callout + 153 (libdispatch.dylib + 98497) [0x7ff8076ca0c1]
  1000  _dispatch_continuation_pop + 463 (libdispatch.dylib + 23397) [0x7ff8076b7b65]
    1000  _dispatch_client_callout + 8 (libdispatch.dylib + 12339) [0x7ff8076b5033]
      1000  <patched truncated backtrace>
        1000  close + 10 (libsystem_kernel.dylib + 8134) [0x7ff807817fc6]
         *1000  hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
           *1000  unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
             *1000  sys_close ??? (kernel + 6451958) [0xffffff80009032f6]
               *1000  soo_close ??? (kernel + 6448665) [0xffffff8000902619]
                 *1000  soclose + 169 (kernel + 7055673) [0xffffff8000996939]
                   *1000  soclose_locked + 226 (kernel + 7052818) [0xffffff8000995e12]
                     *1000  soflow_db_apply + 146 (kernel + 7307298) [0xffffff80009d4022]
                       *1000  cfil_sock_udp_is_closed ??? (kernel + 5047815) [0xffffff80007ac607]
                         *1000  cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
                           *1000  _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
                             *1000  lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]

I'm less confident about the nesessionmanager thread, but it could be:

1000  _dispatch_call_block_and_release + 12 (libdispatch.dylib + 7569) [0x7ff8076b3d91]
  1000  ??? (nesessionmanager + 220610) [0x10ccb3dc2]
    1000  ??? (nesessionmanager + 73682) [0x10cc8ffd2]
      1000  _objc_rootDealloc + 62 (libobjc.A.dylib + 40383) [0x7ff8074c3dbf]
        1000  objc_destructInstance + 99 (libobjc.A.dylib + 40497) [0x7ff8074c3e31]
          1000  object_cxxDestructFromClass(objc_object*, objc_class*) + 83 (libobjc.A.dylib + 68187) [0x7ff8074caa5b]
            1000  <patched truncated backtrace>
              1000  close + 10 (libsystem_kernel.dylib + 8134) [0x7ff807817fc6]
               *1000  hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
                 *1000  unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
                   *1000  sys_close ??? (kernel + 6451958) [0xffffff80009032f6]
                     *1000  soo_close ??? (kernel + 6448665) [0xffffff8000902619]
                       *1000  soclose + 169 (kernel + 7055673) [0xffffff8000996939]
                         *1000  soclose_locked + 1381 (kernel + 7053973) [0xffffff8000996295]
                           *1000  ??? (kernel + 6534839) [0xffffff80009176b7]
                             *1000  ??? (kernel + 5586252) [0xffffff800082fd4c]
                               *1000  sbflush + 88 (kernel + 7105400) [0xffffff80009a2b78]
                                 *1000  soflow_db_apply + 146 (kernel + 7307298) [0xffffff80009d4022]
                                   *1000  cfil_sock_udp_buf_update_per_flow ??? (kernel + 5050092) [0xffffff80007aceec]
                                     *1000  cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
                                       *1000  _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
                                         *1000  lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]

In all cases, there seems to be a call to cfil_acquire_sockbuf, stuck in a permanent wait. Short of a kernel debugger, which I've never used, I don't know how I could get a better idea of what is holding the lock on the socket buffer and why

Any idea is welcome

Almost forgot: how do we know that the content filter is relevant? Because if we disable, kill or uninstall the system extension process, connectivity goes back to (almost) normal. We haven't tried killing the DNS proxy to see if it does anything yet, because reproducing the issue is so time consuming and error prone

Almost forgot: how do we know that the content filter is relevant?

This one is very hard to pin-point exactly what is happening without further trying to narrow down the scope. In these situations I also recommend to narrow down the scope of the scenario as much as possible. For example, on a fresh machine, or VM, ONLY install the NEFilterDataProvider, and run through the reproduction steps above. If the issue reproduces, grab some logs and post them here. If the issue does not reproduce, then install the NEDNSProxyProvider and run through the scenario again. The VPN software is not an Apple API so I cannot recommend anything here, but it would be first best to start with the APIs that you are using to provide the best course of action.

Accepted Answer

Apologies for the confusion, @meaton, but it wasn't a question, I was anticipating a question. We know for a fact that the content filter extension is involved, because as soon as it's gone (uninstalling, disabling or killing all work), network connectivity is regained (with some residual issues, though, e.g. Chrome works fine but Safari doesn't)

Yes, we are trying to isolate the issue by making further tests with some of the variables changed. In fact, I believe we're currently performing a test with the NEFilterDataProvider alone

I'm not sure I understand what you mean "start with the APIs that you are using". Do you want me to give you an overview of the APIs used by the network extensions, or are you referring to the tests you suggested to perform, where we test one API at the time?

Finally, I believe that I have successfully decoded the private kernel APIs in the stack backtraces of the stuck threads. My best guess for the network extension thread is:

1000  -[NEFilterSocketFlow createDataReply:controlSocket:direction:verdict:context:] + 372 (NetworkExtension + 395450) [0x7ff8183728ba]
  1000  <patched truncated backtrace>
    1000  write + 10 (libsystem_kernel.dylib + 17242) [0x7ff80781a35a]
     *1000  hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
       *1000  unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
         *1000  write + 289 (kernel + 6863105) [0xffffff8000967901]
           *1000  writev_internal ??? (kernel + 6862327) [0xffffff80009675f7]
             *1000  fp_writev ??? (kernel + 6861353) [0xffffff8000967229]
               *1000  soo_write ??? (kernel + 6883640) [0xffffff800096c938]
                 *1000  sosend + 2519 (kernel + 7064071) [0xffffff8000998a07]
                   *1000  ctl_send ??? (kernel + 6536106) [0xffffff8000917baa]
                     *1000  cfil_ctl_send ??? (kernel + 4997925) [0xffffff80007a0325]
                       *1000  cfil_action_data_pass ??? (kernel + 5007513) [0xffffff80007a2899]
                         *1000  cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
                           *1000  _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
                             *1000  lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]

It looks like the thread is taking the socket referenced by the NEFilterSocketFlow and attempting to acquire the lock of its send or receive buffer, but the lock was acquired by another thread and never released (for hours/days)

For the mDNSResponder thread stuck in close, I guessed:

1000  _dispatch_source_cancel_callout + 153 (libdispatch.dylib + 98497) [0x7ff8076ca0c1]
  1000  _dispatch_continuation_pop + 463 (libdispatch.dylib + 23397) [0x7ff8076b7b65]
    1000  _dispatch_client_callout + 8 (libdispatch.dylib + 12339) [0x7ff8076b5033]
      1000  <patched truncated backtrace>
        1000  close + 10 (libsystem_kernel.dylib + 8134) [0x7ff807817fc6]
         *1000  hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
           *1000  unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
             *1000  sys_close ??? (kernel + 6451958) [0xffffff80009032f6]
               *1000  soo_close ??? (kernel + 6448665) [0xffffff8000902619]
                 *1000  soclose + 169 (kernel + 7055673) [0xffffff8000996939]
                   *1000  soclose_locked + 226 (kernel + 7052818) [0xffffff8000995e12]
                     *1000  soflow_db_apply + 146 (kernel + 7307298) [0xffffff80009d4022]
                       *1000  cfil_sock_udp_is_closed ??? (kernel + 5047815) [0xffffff80007ac607]
                         *1000  cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
                           *1000  _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
                             *1000  lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]

I'm less confident about the nesessionmanager thread, but it could be:

1000  _dispatch_call_block_and_release + 12 (libdispatch.dylib + 7569) [0x7ff8076b3d91]
  1000  ??? (nesessionmanager + 220610) [0x10ccb3dc2]
    1000  ??? (nesessionmanager + 73682) [0x10cc8ffd2]
      1000  _objc_rootDealloc + 62 (libobjc.A.dylib + 40383) [0x7ff8074c3dbf]
        1000  objc_destructInstance + 99 (libobjc.A.dylib + 40497) [0x7ff8074c3e31]
          1000  object_cxxDestructFromClass(objc_object*, objc_class*) + 83 (libobjc.A.dylib + 68187) [0x7ff8074caa5b]
            1000  <patched truncated backtrace>
              1000  close + 10 (libsystem_kernel.dylib + 8134) [0x7ff807817fc6]
               *1000  hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
                 *1000  unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
                   *1000  sys_close ??? (kernel + 6451958) [0xffffff80009032f6]
                     *1000  soo_close ??? (kernel + 6448665) [0xffffff8000902619]
                       *1000  soclose + 169 (kernel + 7055673) [0xffffff8000996939]
                         *1000  soclose_locked + 1381 (kernel + 7053973) [0xffffff8000996295]
                           *1000  ??? (kernel + 6534839) [0xffffff80009176b7]
                             *1000  ??? (kernel + 5586252) [0xffffff800082fd4c]
                               *1000  sbflush + 88 (kernel + 7105400) [0xffffff80009a2b78]
                                 *1000  soflow_db_apply + 146 (kernel + 7307298) [0xffffff80009d4022]
                                   *1000  cfil_sock_udp_buf_update_per_flow ??? (kernel + 5050092) [0xffffff80007aceec]
                                     *1000  cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
                                       *1000  _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
                                         *1000  lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]

In all cases, there seems to be a call to cfil_acquire_sockbuf, stuck in a permanent wait. Short of a kernel debugger, which I've never used, I don't know how I could get a better idea of what is holding the lock on the socket buffer and why

Any idea is welcome

Sorry, I'm new to the forums and I accidentally accepted my comment above as the answer. It isn't, but I can't remove the checkmark

In fact, I believe we're currently performing a test with the NEFilterDataProvider alone

Excellent, does this issue reproduce if you only run a NEFilterDataProvider alone with minimal code to get the provider up and running? Say, for example, with the Simple Firewall sample project?

Regarding:

I'm not sure I understand what you mean "start with the APIs that you are using"

What I meant was that since the OpenVPN software is a 3rd party piece of software it's best to just describe the bare bones APIs that are found in the providers that you created such as the NEFilterDataProvider and the NEDNSProxyProvider. Stripping those APIs down to the lowest possibly functionality and then trying to reproduce the issue with each provider one at a time is often helpful to narrow down issues in a complex environment.

@meaton I can already tell that Simple Firewall isn't going to trigger the issue: in its handleNewFlow method, it can only return allow, drop and pause verdicts. I strongly suspect that this issue is related to filter data verdicts, as we already fixed (worked around, really) a similar issue in the past that disappeared when we stopped using filter data verdicts

We'll work to minimize the test case and reproduction steps, and open a feedback report, as we've received confirmation via a WWDC Q&A that Apple is already aware of issues with DNS proxy and content filter

@meaton I have more information. As a last resort, we tried adding a mechanism to kill the system extension when the issue occurs, as it appeared to fix the issue in previous tests. It doesn't fix the issue, but it does change its root cause. Elsewhere in our application, we periodically submit a NSURLSessionTask for an HTTP POST request. When connectivity is lost, the request fails with the following error:

Error Domain=NSURLErrorDomain Code=-1001 "The request timed out." UserInfo={_kCFStreamErrorCodeKey=-2103, _NSURLErrorFailingURLSessionTaskErrorKey=LocalUploadTask <03BFA491-37B2-419C-B69A-C3217558760B>.<1>, _NSURLErrorRelatedURLSessionTaskErrorKey=(
    "LocalUploadTask <03BFA491-37B2-419C-B69A-C3217558760B>.<1>"
), NSLocalizedDescription=The request timed out., NSErrorFailingURLStringKey=<URL>, NSErrorFailingURLKey=<URL>, _kCFStreamErrorDomainKey=4}

Our watchdog mechanism notices that the network extension isn't being notified of new network flows anymore (we have a separate mechanism that periodically starts a new UDP flow, to ensure a constant stream of calls to -[NEFilterDataProvider handleNewFlow:]. If we stop receiving calls to that method, it must be some kind of system issue/bug), and kills the network extension process with a KILL signal (don't judge us, we really need this to work). This does not restore connectivity; in fact, the aforementioned NSURLSessionTask starts failing with a different error:

Error Domain=NSURLErrorDomain Code=-1009 "The Internet connection appears to be offline." UserInfo={_kCFStreamErrorCodeKey=50, NSUnderlyingError=0x60000355c180 {Error Domain=kCFErrorDomainCFNetwork Code=-1009 "(null)" UserInfo={_NSURLErrorNWPathKey=unsatisfied (Path was denied by NECP policy), interface: en1[802.11], ipv4, dns, _kCFStreamErrorCodeKey=50, _kCFStreamErrorDomainKey=1}}, _NSURLErrorFailingURLSessionTaskErrorKey=LocalUploadTask <F7A60E10-A65C-4D1F-9954-8A0325B2CB88>.<1>, _NSURLErrorRelatedURLSessionTaskErrorKey=(
    "LocalUploadTask <F7A60E10-A65C-4D1F-9954-8A0325B2CB88>.<1>"
), NSLocalizedDescription=The Internet connection appears to be offline., NSErrorFailingURLStringKey=<URL>, NSErrorFailingURLKey=<URL>, _kCFStreamErrorDomainKey=1}

Note that the failure is in DNS specifically (interface: en1[802.11], ipv4, dns).

Needless to say, we don't explicitly invoke any NECP policies. This is how we create and submit the POST request:

void sendHttpRequest(NSURL *url,
                     const char *jwt,
                     NSString *userAgent,
                     NSData *data,
                     void (^completionHandler)(NSData *_Nullable data, NSURLResponse *_Nullable response, NSError *_Nullable error))
{
    NSURLSessionConfiguration *const configuration = [NSURLSessionConfiguration defaultSessionConfiguration];

    configuration.TLSMinimumSupportedProtocolVersion = tls_protocol_version_TLSv12;
    configuration.timeoutIntervalForResource = 30;
    configuration.requestCachePolicy = NSURLRequestReloadIgnoringLocalAndRemoteCacheData;
    configuration.HTTPCookieStorage = nil;
    configuration.URLCredentialStorage = nil;
    configuration.URLCache = nil;

    NSURLSession *const session = [NSURLSession sessionWithConfiguration:configuration];
    
    NSMutableURLRequest *const request = [NSMutableURLRequest requestWithURL:url];
    
    [request setValue:@"application/json" forHTTPHeaderField:@"Content-Type"];
    [request setValue:[NSString stringWithFormat:@"Bearer %s", jwt] forHTTPHeaderField:@"Authorization"];
    [request setValue:userAgent forHTTPHeaderField:@"User-Agent"];
    
    request.HTTPMethod = @"POST";

    NSURLSessionTask *const task = [session uploadTaskWithRequest:request fromData:data completionHandler:completionHandler];

    [task resume];
}

Spindump shows why killing the process didn't work:


Process:          ... [2231] (suspended) (zombie)

UUID:             ...

Shared Cache:     None

Architecture:     x86_64

Parent:           launchd [1]

UID:              0

Footprint:        4096  B

Time Since Fork:  143146s

Num samples:      1001 (1-1001)

Note:             Suspended for 1001 samples

Note:             Terminated (zombie) for 1001 samples



  Thread 0x10187    1001 samples (1-1001)    priority 31 (base 31)

  1001  <truncated backtrace>

   *1001  hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]

     *1001  unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]

       *1001  write + 289 (kernel + 6863105) [0xffffff8000967901]

         *1001  ??? (kernel + 6862327) [0xffffff80009675f7]

           *1001  ??? (kernel + 6861353) [0xffffff8000967229]

             *1001  ??? (kernel + 6883640) [0xffffff800096c938]

               *1001  sosend + 2519 (kernel + 7064071) [0xffffff8000998a07]

                 *1001  ??? (kernel + 6536106) [0xffffff8000917baa]

                   *1001  ??? (kernel + 4997925) [0xffffff80007a0325]

                     *1001  ??? (kernel + 5007513) [0xffffff80007a2899]

                       *1001  ??? (kernel + 5015481) [0xffffff80007a47b9]

                         *1001  ??? (kernel + 6729239) [0xffffff8000946e17]

                           *1001  lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]

                             *1001  thread_block_reason + 192 (kernel + 1828864) [0xffffff800049a800]

                               *1001  ??? (kernel + 1837097) [0xffffff800049c829]

                                 *1001  machine_switch_context + 215 (kernel + 2961047) [0xffffff80005aee97] (suspended)



  Binary Images:

           0x109935000 -                ???  ???                 ...

   *0xffffff80002dc000 - 0xffffff8000cdbfff  kernel (8796.101.5) <CF2A42DA-3F7C-30C6-9433-6F2076FF1F94>  /System/Library/Kernels/kernel

The process was killed, but the stuck thread is stuck in a kernel mode wait that can't be interrupted, so the process remains in a zombified state, still exhibiting symptoms of the kernel mode deadlock.

We encountered the same problem, have you found a solution?

Loss of connectivity with DNS proxy and content filter extensions running at the same time
 
 
Q