DispatchIO crashes, tips on debugging?

BUG IN CLIENT OF LIBDISPATCH: Unexpected EV_VANISHED (do not destroy random mach ports or file descriptors)

Which, ok, clear: somehow a file descriptor is being closed before DispatchIO.close() is called, yes?

Only I can't figure out where it is being closed. I am currently using change_fdguard_np() to prevent closes anywhere else, and every single place where I call Darwin.close() is preceded by another call to change_fdguard_npand thenDispatchIO.close()`. eg

            self.unguardSocket()
            self.readDispatcher?.close()
            Darwin.close(self.socket)
            self.socket = -1
            self.completion(self)
Answered by DTS Engineer in 797727022

Create a socketpair Create a DispatchIO using socket[0] Create a FileHandle using socket[1] Do work When done, DispatchIO.close() and FileHandle.close() Also close the sockets using Darwin.close().

Did you immediately issue the "Darwin.close"? If so, then I believe that's a programmatic error. This isn't very clear in the swift documentation, but "close" doesn't mean that the system is done with your fd. That doesn't occur until your cleanup handler is called:

"If the existing channel is associated with a file descriptor, the system maintains control over the file descriptor until the new channel is also closed, an error occurs on the file descriptor, or all references to channels tied to that file descriptor are released. When the file descriptor is released, the cleanup_handler block is enqueued on the specified queue and the system relinquishes control over the file descriptor."

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Can you attach crash logs? I'd like to get a better picture of what's actually going on before I offer any suggestions.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

The crashed thread is

Thread 5 Crashed:
0   libdispatch.dylib             	    0x7ff80af41e5f _dispatch_source_merge_evt.cold.1 + 24
1   libdispatch.dylib             	    0x7ff80af25bb1 _dispatch_source_merge_evt + 149
2   libdispatch.dylib             	    0x7ff80af2f9b6 _dispatch_event_loop_merge + 112
3   libdispatch.dylib             	    0x7ff80af21db3 _dispatch_workloop_worker_thread + 438
4   libsystem_pthread.dylib       	    0x7ff80b0c5fd0 _pthread_wqthread + 326
5   libsystem_pthread.dylib       	    0x7ff80b0c4f57 start_wqthread + 15

I've attached a trimmed & edited copy of the full crash log (but not the entire .ips file).

-------------------------------------
Translated Report (Full Report Below)
-------------------------------------

Process:               com.kithrup.SimpleTPP.Redirector [34586]
Path:                  /Library/SystemExtensions/*/com.kithrup.SimpleTPP.Redirector
Identifier:            com.kithrup.SimpleTPP.Redirector
Version:               1.0.0 (1.0.0)
Code Type:             X86-64 (Native)
Parent Process:        launchd [1]
User ID:               0

Date/Time:             2024-07-26 12:23:29.9393 +0100
OS Version:            macOS 12.7.4 (21H1123)
Report Version:        12
Bridge OS Version:     3.0 (14Y910)
Anonymous UUID:        0732B8EB-713E-4F76-3630-0B849B792050

Sleep/Wake UUID:       7739B226-8950-42F0-A242-CCFA408927EA

Time Awake Since Boot: 7400000 seconds
Time Since Wake:       1887504 seconds

System Integrity Protection: disabled

Crashed Thread:        5

Exception Type:        EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes:       0x0000000000000001, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace SIGNAL, Code 4 Illegal instruction: 4
Terminating Process:   exc handler [34586]

Application Specific Information:
BUG IN CLIENT OF LIBDISPATCH: Unexpected EV_VANISHED (do not destroy random mach ports or file descriptors)
Abort Cause 52


Thread 0:
0   libsystem_kernel.dylib        	    0x7ff80b092192 __sigsuspend_nocancel + 10
1   libdispatch.dylib             	    0x7ff80af2281b _dispatch_sigsuspend + 36
2   libdispatch.dylib             	    0x7ff80af227f7 _dispatch_sig_thread + 49

Thread 1::  Dispatch queue: com.apple.NSXPCConnection.user.63JU25B8Q7.com.kithrup.SimpleTPP.Redirector.34575
0   libsystem_platform.dylib      	    0x7ff80b0dc639 _platform_bzero$VARIANT$Haswell + 57
1   libsystem_malloc.dylib        	    0x7ff80aee376a nanov2_allocate + 373
2   libsystem_malloc.dylib        	    0x7ff80aee568f nanov2_calloc + 126
3   libsystem_malloc.dylib        	    0x7ff80aeffb75 _malloc_zone_calloc + 60
4   libobjc.A.dylib               	    0x7ff80af61631 _objc_rootAllocWithZone + 49
5   CoreFoundation                	    0x7ff80b1724e6 +[NSInvocation _invocationWithMethodSignature:frame:] + 98
6   CoreFoundation                	    0x7ff80b1fb390 __block_forwarding___ + 120
7   CoreFoundation                	    0x7ff80b1fb301 __forwarding_prep_b___ + 113
8   com.kithrup.SimpleTPP.Redirector	       0x105c19c69 $s10ObjectiveC8ObjCBoolVIyBy_SbIegy_TR + 12 [inlined]
9   com.kithrup.SimpleTPP.Redirector	       0x105c19c69 $s40com.kithrup.SimpleTPP_Redirector0F0C10getEnabledyyySbXEF + 15 (Redirector.swift:776) [inlined]
10  com.kithrup.SimpleTPP.Redirector	       0x105c19c69 @objc Redirector.getEnabled(_:) + 73 (:775)
11  Foundation                    	    0x7ff80c04d018 __NSXPCCONNECTION_IS_CALLING_OUT_TO_EXPORTED_OBJECT_S1__ + 10
12  Foundation                    	    0x7ff80bff76e8 -[NSXPCConnection _decodeAndInvokeMessageWithEvent:flags:] + 2347
13  Foundation                    	    0x7ff80bfae2ce message_handler + 206
14  libxpc.dylib                  	    0x7ff80ae07ae0 _xpc_connection_call_event_handler + 56
15  libxpc.dylib                  	    0x7ff80ae068c6 _xpc_connection_mach_event + 1413
16  libdispatch.dylib             	    0x7ff80af113b1 _dispatch_client_callout4 + 9
17  libdispatch.dylib             	    0x7ff80af2a041 _dispatch_mach_msg_invoke + 445
18  libdispatch.dylib             	    0x7ff80af171cd _dispatch_lane_serial_drain + 342
19  libdispatch.dylib             	    0x7ff80af2ab77 _dispatch_mach_invoke + 484
20  libdispatch.dylib             	    0x7ff80af171cd _dispatch_lane_serial_drain + 342
21  libdispatch.dylib             	    0x7ff80af17e30 _dispatch_lane_invoke + 417
22  libdispatch.dylib             	    0x7ff80af21eee _dispatch_workloop_worker_thread + 753
23  libsystem_pthread.dylib       	    0x7ff80b0c5fd0 _pthread_wqthread + 326
24  libsystem_pthread.dylib       	    0x7ff80b0c4f57 start_wqthread + 15

Thread 2::  Dispatch queue: flow director queue
0   libsystem_kernel.dylib        	    0x7ff80b08dc8e __sysctl + 10
1   libsystem_c.dylib             	    0x7ff80af99e2d sysctl + 129
2   libsystem_info.dylib          	    0x7ff80b0ebafe getifaddrs + 78
3   libsystem_info.dylib          	    0x7ff80b0edb57 if_indextoname + 24
4   NetworkExtension              	    0x7ff8199c8857 __NEFlowCopyDescription + 175
5   CoreFoundation                	    0x7ff80b15191c CFCopyDescription + 144
6   CoreFoundation                	    0x7ff80b18ba1d -[__NSCFType description] + 9
7   Foundation                    	    0x7ff80bfae848 _NS_os_log_callback + 273
8   libsystem_trace.dylib         	    0x7ff80ae3f085 _os_log_fmt_flatten_NSCF + 42
9   libsystem_trace.dylib         	    0x7ff80ae3e95d _os_log_fmt_flatten_object + 215
10  libsystem_trace.dylib         	    0x7ff80ae3c172 _os_log_impl_flatten_and_send + 2243
11  libsystem_trace.dylib         	    0x7ff80ae3b886 _os_log + 128
12  libsystem_trace.dylib         	    0x7ff80ae3b7fa _os_log_impl + 21
13  NetworkExtension              	    0x7ff8199cb063 __flow_director_create_io_handler_block_invoke + 5773
14  NetworkExtension              	    0x7ff8199c7a29 flow_director_ctl_source_do_read + 233
15  libdispatch.dylib             	    0x7ff80af11317 _dispatch_client_callout + 8
16  libdispatch.dylib             	    0x7ff80af13d7c _dispatch_continuation_pop + 453
17  libdispatch.dylib             	    0x7ff80af25208 _dispatch_source_invoke + 2179
18  libdispatch.dylib             	    0x7ff80af171cd _dispatch_lane_serial_drain + 342
19  libdispatch.dylib             	    0x7ff80af17dfd _dispatch_lane_invoke + 366
20  libdispatch.dylib             	    0x7ff80af21eee _dispatch_workloop_worker_thread + 753
21  libsystem_pthread.dylib       	    0x7ff80b0c5fd0 _pthread_wqthread + 326
22  libsystem_pthread.dylib       	    0x7ff80b0c4f57 start_wqthread + 15

Thread 3::  Dispatch queue: com.apple.libdispatch-io.opq
0   libsystem_kernel.dylib        	    0x7ff80b090e96 __sendmsg + 10
1   NetworkExtension              	    0x7ff8199c8c91 flow_director_ctl_write + 128
2   NetworkExtension              	    0x7ff8199c8bf6 flow_director_msg_send + 71
3   NetworkExtension              	    0x7ff8199c8b64 flow_director_send_close + 170
4   NetworkExtension              	    0x7ff8199c93ed flow_write_close + 616
5   NetworkExtension              	    0x7ff8199c5ca1 flow_close + 163
6   NetworkExtension              	    0x7ff8199667ff -[NEAppProxyFlow closeWriteWithError:] + 129
7   com.kithrup.SimpleTPP.Redirector	       0x105c066eb FlowData.closeWithError(_:) + 139 (FlowData.swift:209)
8   com.kithrup.SimpleTPP.Redirector	       0x105c0786b TCPFlowData.closeWithError(_:) + 235 (FlowData.swift:282)
9   com.kithrup.SimpleTPP.Redirector	       0x105c09700 TCPFlowData.handleAppData(_:_:_:) + 1328 (FlowData.swift:421)
10  libswiftDispatch.dylib        	    0x7ff81dccc7c9 partial apply for closure #1 in OS_dispatch_io.read(offset:length:queue:ioHandler:) + 57
11  libswiftDispatch.dylib        	    0x7ff81dccc81f thunk for @escaping @callee_guaranteed (@unowned Bool, @guaranteed OS_dispatch_data?, @unowned Int32) -> () + 63
12  libdispatch.dylib             	    0x7ff80af3bce1 ___dispatch_operation_deliver_data_block_invoke + 109
13  libdispatch.dylib             	    0x7ff80af100cc _dispatch_call_block_and_release + 12
14  libdispatch.dylib             	    0x7ff80af11317 _dispatch_client_callout + 8
15  libdispatch.dylib             	    0x7ff80af17317 _dispatch_lane_serial_drain + 672
16  libdispatch.dylib             	    0x7ff80af17e30 _dispatch_lane_invoke + 417
17  libdispatch.dylib             	    0x7ff80af171cd _dispatch_lane_serial_drain + 342
18  libdispatch.dylib             	    0x7ff80af17dfd _dispatch_lane_invoke + 366
19  libdispatch.dylib             	    0x7ff80af21eee _dispatch_workloop_worker_thread + 753
20  libsystem_pthread.dylib       	    0x7ff80b0c5fd0 _pthread_wqthread + 326
21  libsystem_pthread.dylib       	    0x7ff80b0c4f57 start_wqthread + 15

Thread 4::  Dispatch queue: NEFlow queue
0   libswiftCore.dylib            	    0x7ff818671818 String.init<a>(describing:) + 40
1   com.kithrup.SimpleTPP.Redirector	       0x105c05ff8 FlowData.typeName.getter + 40 (FlowData.swift:177)
2   com.kithrup.SimpleTPP.Redirector	       0x105c08fa8 TCPFlowData.posixWrite(_:) + 168 (FlowData.swift:362)
3   com.kithrup.SimpleTPP.Redirector	       0x105c09d91 closure #1 in TCPFlowData.readAppData() + 897 (FlowData.swift:440)
4   com.kithrup.SimpleTPP.Redirector	       0x105c0a294 thunk for @escaping @callee_guaranteed (@guaranteed Data?, @guaranteed Error?) -> () + 132
5   NetworkExtension              	    0x7ff819966ecf __51-[NEAppProxyTCPFlow readDataWithCompletionHandler:]_block_invoke + 107
6   NetworkExtension              	    0x7ff8199c976a __flow_call_stream_read_handler_block_invoke + 24
7   libdispatch.dylib             	    0x7ff80af100cc _dispatch_call_block_and_release + 12
8   libdispatch.dylib             	    0x7ff80af11317 _dispatch_client_callout + 8
9   libdispatch.dylib             	    0x7ff80af17317 _dispatch_lane_serial_drain + 672
10  libdispatch.dylib             	    0x7ff80af17dfd _dispatch_lane_invoke + 366
11  libdispatch.dylib             	    0x7ff80af21eee _dispatch_workloop_worker_thread + 753
12  libsystem_pthread.dylib       	    0x7ff80b0c5fd0 _pthread_wqthread + 326
13  libsystem_pthread.dylib       	    0x7ff80b0c4f57 start_wqthread + 15

Thread 5 Crashed:
0   libdispatch.dylib             	    0x7ff80af41e5f _dispatch_source_merge_evt.cold.1 + 24
1   libdispatch.dylib             	    0x7ff80af25bb1 _dispatch_source_merge_evt + 149
2   libdispatch.dylib             	    0x7ff80af2f9b6 _dispatch_event_loop_merge + 112
3   libdispatch.dylib             	    0x7ff80af21db3 _dispatch_workloop_worker_thread + 438
4   libsystem_pthread.dylib       	    0x7ff80b0c5fd0 _pthread_wqthread + 326
5   libsystem_pthread.dylib       	    0x7ff80b0c4f57 start_wqthread + 15

Thread 6::  Dispatch queue: com.kithrup.SimpleTPP.Redirector.tcpFlowData.1C58D602-517E-4128-A610-46786DE4BB56.read
0   libsystem_malloc.dylib        	    0x7ff80aee364f nanov2_allocate + 90
1   libsystem_malloc.dylib        	    0x7ff80aee35a4 nanov2_malloc + 56
2   libsystem_malloc.dylib        	    0x7ff80aeffabb _malloc_zone_malloc + 125
3   libswiftCore.dylib            	    0x7ff81887f5e8 swift_slowAlloc + 40
4   libswiftCore.dylib            	    0x7ff81887f757 swift_allocObject + 39
5   libswiftCore.dylib            	    0x7ff8186d5463 Mirror.init(internalReflecting:subjectType:customAncestor:) + 323
6   libswiftCore.dylib            	    0x7ff8186077d4 Mirror.init(reflecting:) + 356
7   libswiftCore.dylib            	    0x7ff8186723a8 String.init</a><a>(describing:) + 3000
8   com.kithrup.SimpleTPP.Redirector	       0x105c05ff8 FlowData.typeName.getter + 40 (FlowData.swift:177)
9   com.kithrup.SimpleTPP.Redirector	       0x105c09891 TCPFlowData.readAppData() + 81 (FlowData.swift:426)
10  com.kithrup.SimpleTPP.Redirector	       0x105c08ee9 thunk for @escaping @callee_guaranteed () -> () + 25
11  libdispatch.dylib             	    0x7ff80af100cc _dispatch_call_block_and_release + 12
12  libdispatch.dylib             	    0x7ff80af11317 _dispatch_client_callout + 8
13  libdispatch.dylib             	    0x7ff80af17317 _dispatch_lane_serial_drain + 672
14  libdispatch.dylib             	    0x7ff80af17dfd _dispatch_lane_invoke + 366
15  libdispatch.dylib             	    0x7ff80af21eee _dispatch_workloop_worker_thread + 753
16  libsystem_pthread.dylib       	    0x7ff80b0c5fd0 _pthread_wqthread + 326
17  libsystem_pthread.dylib       	    0x7ff80b0c4f57 start_wqthread + 15

Thread 7:
0   libsystem_pthread.dylib       	    0x7ff80b0c4f48 start_wqthread + 0

Thread 8:
0   libsystem_pthread.dylib       	    0x7ff80b0c4f48 start_wqthread + 0


</a>

So, yay, I can reproduce it in my test code. And I used fdguard (with CLOSE not DUP), and never unguard the file descriptors, and it still gets the EV_VANISH crash.

The basic flow of the code is:

  • Create a socketpair
  • Create a DispatchIO using socket[0]
  • Create a FileHandle using socket[1]
  • Do work
  • When done, DispatchIO.close() and FileHandle.close()
  • Also close the sockets using Darwin.close().
  • CRASH!

The problematical one seems to be the file descriptor for the DispatchIO.

Accepted Answer

Create a socketpair Create a DispatchIO using socket[0] Create a FileHandle using socket[1] Do work When done, DispatchIO.close() and FileHandle.close() Also close the sockets using Darwin.close().

Did you immediately issue the "Darwin.close"? If so, then I believe that's a programmatic error. This isn't very clear in the swift documentation, but "close" doesn't mean that the system is done with your fd. That doesn't occur until your cleanup handler is called:

"If the existing channel is associated with a file descriptor, the system maintains control over the file descriptor until the new channel is also closed, an error occurs on the file descriptor, or all references to channels tied to that file descriptor are released. When the file descriptor is released, the cleanup_handler block is enqueued on the specified queue and the system relinquishes control over the file descriptor."

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Ok, I am going to say that is horribly unclear then -- I read it completely differently, and "the file descriptor should not be closed by the application until the cleanup handler is called" would have made all the difference... 😄

And now I can't get it to call the cleanup handler at all, sigh.

Ok, using that information, I was able to reproduce it in a standalone program, which helped me track it down. (And then I had to deal with the cleanupHandler not being called, but I found that out and got it done.)

Now I'm sometimes getting a crash:

fileHandle?.readabilityHandler = { fh in
    let data = try fh.availableData
    print("Got \(data.count) bytes")
}
*** Terminating app due to uncaught exception 'NSFileHandleOperationException', reason: '*** -[NSConcreteFileHandle availableData]: Bad file descriptor'

This only happens sometimes, in my tests.

Now I'm sometimes getting a crash:

Sigh... I suspected you might run into issue here next, but I didn't have time to write up the full set of issues here.

SO, the first question I have here is "What are you actually trying to do?". Frankly, choosing to use both FileHandle and DispatchIO for "parallel" operation a bit odd and, using both correctly is MUCH trickier than it might appear. The issue here is that while the two APIs provide similar functionality, they were designed and built at totally different time (>15 years apart) with totally different expectations (for example, run loops vs GCD queue). While it's possible to use both of them, actually using both me them correctly in the same context can be much trickier than it might seem and would require a fairly deep knowledge of the exact implementation of both.

Finally, and this is probably the biggest concern, why are you "bothering" with both? Using both APIs means that your creating two completely different implementation to solve the same underlying issue, opening up yourself to a lot more bugs without a lot of "upside".

In terms of the specific issue here:

*** Terminating app due to uncaught exception 'NSFileHandleOperationException', reason: '*** -[NSConcreteFileHandle availableData]: Bad file descriptor'

NSFileHandle has requirements as DispatchIO ("you can't close the fd until the API is completely done with it"), except it's implementation makes that trickier to implement. In particular:

-The API is inherently "racy". You can't coordinate your own close with FileHandle internals, so it's entirely possible to call "close" while your readabilityHandler is (or is about to be) being called.

-You can tell FileHandle not to dealloc the descriptor, but that doesn't actually make things "safer". If YOU close the fd before FileHandle is completely "done", then you can create exactly this crash. Practically speaking, I think using this option would only really make sense if there was some other architecture in place that meant you weren't toing to close the fd until LONG after FileHandle was "gone".

-Note that "close after the FileHandle is gone" can be much harder than it sounds, particularly if GCD is added into the mix. In particular, Foundation (and AppKit/UIKit) routinely autorelease objects and that means they regularly "leak" object references into GCD queues, meaning you no longer control the objects lifetime.

In concrete terms, with code like this:

queue.async {
	let handle = FileHandle(fileDescriptor: fd)

}

...the assumption is that "handle" will be destroyed at the end of the block is NOT necessarily true. FileHandle autoreleases, which means it won't actually be destroyed until some "later" point when the queue drains it's own autorelease pool.

If you're going to use FileHandle, then my recommendation would be that you pass ownership of the fd over to it and let it close it, at which point all you need to do is be sure that the FileHandle object itself isn't leaked. However, my actual recommendation is that you pick which API you like better (presumably, DispatchIO) and ONLY use that API.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Most this particular session has been in order to come up with a way to have tests for our transparent proxy provider that can be run without using the main extension -- so I rolled the networking code out, wrote a mimic of NEAppProxyFlow, and then wrote tests for it. It uses socket I/O to communicate with each side ("app" and "TPP"). Internally, it uses DispachIO to handle data sent from the "app", while it exposes the "TPP" side as a FileHandle (and it does that so that I could, if I needed to, pass it over XPC).

This isn't the only way I could have implemented it, admittedly, but it grew out of my first approach, which was to use kqueue.

(Oh, and I got the FileHandle issue solved -- the class behaves super badly if it gets closed more than once, so I isolated that code into a function, turned the object into an optional, and set it to nil after closing it.)

Since I don't think I'll end up using XPC to pass the FileHandle around, I can probably turn it all into DispatchIO. I hadn't primarily because I kept getting crashes, and/or leaked file descriptors, but now that I can get the DispatchIO part working without crashes, it's time to revisit it.

Most this particular session has been in order to come up with a way to have tests for our transparent proxy provider that can be run without using the main extension -- so I rolled the networking code out, wrote a mimic of NEAppProxyFlow, and then wrote tests for it.

OK, that makes me feel a lot better about all of this. Most of the issues you'd have here (beyond, "it's crashing") are caused by complicated differences and interactions between dispatch queues and run loops. Those tend to cause issues with extended/unexpected object lifetimes and/or leaks, which can be a huge problem in long running apps, but much less of an issue in shorter running tests that you're not going to sell to anyone.

It uses socket I/O to communicate with each side ("app" and "TPP"). Internally, it uses DispachIO to handle data sent from the "app", while it exposes the "TPP" side as a FileHandle (and it does that so that I could, if I needed to, pass it over XPC).

FYI, you can also pass dispatch objects across XPC but it would be more complicated than using FileHandle/NSXPCConnection integration. This is another case of the kind of shortcut that's reasonable in a test tool but probably a bad idea in shipping code.

(Oh, and I got the FileHandle issue solved -- the class behaves super badly if it gets closed more than once, so I isolated that code into a function, turned the object into an optional, and set it to nil after closing it.)

Yep, I can see that. Again, keep in mind that FileHandle is ANCIENT- it came from NextStep, so it's older than macOS X. A lof of things have changed* since then, but it's also the kind of class where changing any of it's implementation details can be risky.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Yeah, based on all of this, I think I'm going to take a stab at refactoring all of it to just use DispatchIO. If I end up choosing to pass it over XPC, I could just use the FileHandle and turn it into a DispatchIO on the other side.

DispatchIO crashes, tips on debugging?
 
 
Q