The environment:
- macOS Ventura 13.3.1 on a Mac mini (2018 model)
- A content filter extension (
content-filter-provider), configured withfilterSockets = YES,filterPackets = NO - A DNS proxy extension (
dns-proxy) - An Ethernet network and two Wi-Fi access points
- OpenVPN
The exact conditions to reproduce the issue are obscure. So far, we managed to narrow it down to:
- Connect to the Ethernet network
- Connect to Wi-Fi access point A, and wait a couple minutes
- Connect to Wi-Fi access point B (in our tests, this is a mobile hotspot; not sure if relevant)
- Establish an OpenVPN tunnel
- Put machine to sleep, and wait 15 minutes
- Disable Wi-Fi access point B, and wait 3 hours
- Wake up the machine
The result is a complete loss of network connectivity: the machine can't even renew the DHCP lease
The content filter logs new flows without blocking or pausing them. If a flow is missing some address information, it peeks 1 incoming byte or 1 outgoing byte, whichever comes first ([NEFilterNewFlowVerdict filterDataVerdictWithFilterInbound:YES peekInboundBytes:1 filterOutbound:YES peekOutboundBytes:1]), in the hope to get more complete flow metadata once data is being exchanged. Most importantly, if a flow looks like a DNS request over UDP, the content filter sniffs all incoming data, asking for at most 10000 bytes at a time ([NEFilterNewFlowVerdict filterDataVerdictWithFilterInbound:YES peekInboundBytes:10000 filterOutbound:YES peekOutboundBytes:1]). The DNS answer sniffer has already caused conflict with the DNS proxy in the past, but we fixed it.
I don't know the DNS proxy in as much detail, but I know that it has both blacklisting and proxying features (it proxies requests to a CloudFlare API, I believe), so it's a little more complex than a simple pass-through like the content filter
The last event logged by the content filter is, in fact, a DNS query whose answer is being sniffed. Then it stops logging anything, coinciding with the system-wide loss of connectivity (verified from the logs of other applications and services)
Spindump shows that the content filter is idle, stuck in NetworkExtension internals, which in turn are stuck in an event wait (lck_mtx_sleep). Also of note, mDNSResponder and nesessionmanager are both stuck in a kernel mode call to soflow_db_apply, as a result of closeing a socket. In all cases, by "stuck", I mean that the thread was found to be executing that code in 1000 samples out of 1000, with no use of CPU time
An anonymized subset of the log is attached:
Are we doing anything obviously wrong? Any gotchas that we may have tripped onto in our extension implementations? Or is it possibly a system bug that we should open a radar for? Thank you
Apologies for the confusion, @meaton, but it wasn't a question, I was anticipating a question. We know for a fact that the content filter extension is involved, because as soon as it's gone (uninstalling, disabling or killing all work), network connectivity is regained (with some residual issues, though, e.g. Chrome works fine but Safari doesn't)
Yes, we are trying to isolate the issue by making further tests with some of the variables changed. In fact, I believe we're currently performing a test with the NEFilterDataProvider alone
I'm not sure I understand what you mean "start with the APIs that you are using". Do you want me to give you an overview of the APIs used by the network extensions, or are you referring to the tests you suggested to perform, where we test one API at the time?
Finally, I believe that I have successfully decoded the private kernel APIs in the stack backtraces of the stuck threads. My best guess for the network extension thread is:
1000 -[NEFilterSocketFlow createDataReply:controlSocket:direction:verdict:context:] + 372 (NetworkExtension + 395450) [0x7ff8183728ba]
1000 <patched truncated backtrace>
1000 write + 10 (libsystem_kernel.dylib + 17242) [0x7ff80781a35a]
*1000 hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
*1000 unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
*1000 write + 289 (kernel + 6863105) [0xffffff8000967901]
*1000 writev_internal ??? (kernel + 6862327) [0xffffff80009675f7]
*1000 fp_writev ??? (kernel + 6861353) [0xffffff8000967229]
*1000 soo_write ??? (kernel + 6883640) [0xffffff800096c938]
*1000 sosend + 2519 (kernel + 7064071) [0xffffff8000998a07]
*1000 ctl_send ??? (kernel + 6536106) [0xffffff8000917baa]
*1000 cfil_ctl_send ??? (kernel + 4997925) [0xffffff80007a0325]
*1000 cfil_action_data_pass ??? (kernel + 5007513) [0xffffff80007a2899]
*1000 cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
*1000 _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
*1000 lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]
It looks like the thread is taking the socket referenced by the NEFilterSocketFlow and attempting to acquire the lock of its send or receive buffer, but the lock was acquired by another thread and never released (for hours/days)
For the mDNSResponder thread stuck in close, I guessed:
1000 _dispatch_source_cancel_callout + 153 (libdispatch.dylib + 98497) [0x7ff8076ca0c1]
1000 _dispatch_continuation_pop + 463 (libdispatch.dylib + 23397) [0x7ff8076b7b65]
1000 _dispatch_client_callout + 8 (libdispatch.dylib + 12339) [0x7ff8076b5033]
1000 <patched truncated backtrace>
1000 close + 10 (libsystem_kernel.dylib + 8134) [0x7ff807817fc6]
*1000 hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
*1000 unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
*1000 sys_close ??? (kernel + 6451958) [0xffffff80009032f6]
*1000 soo_close ??? (kernel + 6448665) [0xffffff8000902619]
*1000 soclose + 169 (kernel + 7055673) [0xffffff8000996939]
*1000 soclose_locked + 226 (kernel + 7052818) [0xffffff8000995e12]
*1000 soflow_db_apply + 146 (kernel + 7307298) [0xffffff80009d4022]
*1000 cfil_sock_udp_is_closed ??? (kernel + 5047815) [0xffffff80007ac607]
*1000 cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
*1000 _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
*1000 lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]
I'm less confident about the nesessionmanager thread, but it could be:
1000 _dispatch_call_block_and_release + 12 (libdispatch.dylib + 7569) [0x7ff8076b3d91]
1000 ??? (nesessionmanager + 220610) [0x10ccb3dc2]
1000 ??? (nesessionmanager + 73682) [0x10cc8ffd2]
1000 _objc_rootDealloc + 62 (libobjc.A.dylib + 40383) [0x7ff8074c3dbf]
1000 objc_destructInstance + 99 (libobjc.A.dylib + 40497) [0x7ff8074c3e31]
1000 object_cxxDestructFromClass(objc_object*, objc_class*) + 83 (libobjc.A.dylib + 68187) [0x7ff8074caa5b]
1000 <patched truncated backtrace>
1000 close + 10 (libsystem_kernel.dylib + 8134) [0x7ff807817fc6]
*1000 hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96]
*1000 unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020]
*1000 sys_close ??? (kernel + 6451958) [0xffffff80009032f6]
*1000 soo_close ??? (kernel + 6448665) [0xffffff8000902619]
*1000 soclose + 169 (kernel + 7055673) [0xffffff8000996939]
*1000 soclose_locked + 1381 (kernel + 7053973) [0xffffff8000996295]
*1000 ??? (kernel + 6534839) [0xffffff80009176b7]
*1000 ??? (kernel + 5586252) [0xffffff800082fd4c]
*1000 sbflush + 88 (kernel + 7105400) [0xffffff80009a2b78]
*1000 soflow_db_apply + 146 (kernel + 7307298) [0xffffff80009d4022]
*1000 cfil_sock_udp_buf_update_per_flow ??? (kernel + 5050092) [0xffffff80007aceec]
*1000 cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9]
*1000 _sleep ??? (kernel + 6729239) [0xffffff8000946e17]
*1000 lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305]
In all cases, there seems to be a call to cfil_acquire_sockbuf, stuck in a permanent wait. Short of a kernel debugger, which I've never used, I don't know how I could get a better idea of what is holding the lock on the socket buffer and why
Any idea is welcome