DriverKit. Plug/unplug test leads to MacOS panic

Dear Apple engineers,
We have developed a DriverKit (DEXT) driver for an HBA RAID controller.
The RAID controller is connected to hosts through Thunderbolt (PCIe port of the Thunderbolt controller).
We do plug/unplug tests to verify the developed driver. The test always fails in about 100 cycles with a MacOS crash (panic).

The panic contains
“LLC Bus error (Unavailable) from cpu0: FAR=0xa40100008 LLC_ERR_STS/ADR/INF=0x80/0x300480a40100008/0x1400000005 addr=0xa40100008 cmd=0x18(ACC_CIFL2C_CMD_RD_LD: request for load miss in E or S state)”

At first we assumed that the issue is with hardware. But we did this test on different hosts (MacMini M3 and M4) with different units of our device.
The error points to the same physical address FAR=0xa40100008 even if the hosts are different.

The 2 full panic logs are attached (one for M4, another one for M3 host).

Could you share your understanding of the crash and give any hints on how we can fix it?

Please let us know if you need any additional data. Thank you

M3 panic: https://drive.google.com/file/d/1GJXd3tTW6ajdrHpFsJxO_tWWYKYIgcMc/view?usp=share_link

M4 panic: https://drive.google.com/file/d/1SU-3aBSdhLsyhhxsLknzw9wGvBQ9TbJC/view?usp=share_link

Could you share your understanding of the crash and give any hints on how we can fix it?

So, let me actually start by commenting on this:

At first, we assumed that the issue is with hardware.

The first thing to understand here is that DEXTs are FULLY capable of panicking the kernel and probably always will be, particularly PCI DEXTs. The main benefit DEXTs provide is that they DRAMATICALLY improve overall system security and risk by constraining the "range" of what it's POSSIBLE for a component to do. Your DEXT only has access to a very limited set of kernel data, so that's the ONLY kernel data your DEXT interacts with. It's possible for a network DEXT to disrupt the network stack, but it's very difficult to see how it would disrupt the file system.

However, your DEXT is still being given access to many of the same resources it would have access to as a KEXT, and many of those resources are inherently dangerous. In the case of the PCI family, that issue is quite direct— I don't know of any way to build a "safe" API that allows for the direct manipulation of physical memory bus addresses and DMA.

Shifting to the panic logs:

Please let us know if you need any additional data. Thank you

For reference, this forum thread outlines how to symbolicate our modern kernel panic format. The process is a bit laborious, but it will ultimately give you a stack trace for every thread in the system at the point you panic. In any case, if you symbolicate either panic, you'll find that both panics are from your driver:

0	kernel.release.t6041	0xfffffe0008af1e58	panic_trap_to_debugger + 944	(debug.c:1403)
1	kernel.release.t6041	0xfffffe00093f59c8	panic + 60	(debug.c:1159)
2	kernel.release.t6041	0xfffffe0008c8f334	generic_platform_error_handler + 2220	(generic_platform_error_handler.c:803)
3	kernel.release.t6041	0xfffffe0008c65de4	sleh_synchronous + 412	(sleh.c:1442)
4	kernel.release.t6041	0xfffffe0008aa3d48	fleh_synchronous + 72	
5	[2, 0]
User Frames
0	PCIDriverKit	0x18018f798	IOPCIDevice::MemoryRead32(unsigned char, unsigned long long, unsigned int*, unsigned int) + 96	(IOPCIDevice.cpp:295)
1	[305, 70596]
2	[305, 46508]
3	[305, 16428]
4	[305, 27460]
5	[305, 24932]
6	DriverKit	0x1800731e8	IOTimerDispatchSource::TimerOccurred_Invoke(IORPC, OSMetaClassBase*, void (*)(OSMetaClassBase*, OSAction*, unsigned long long), OSMetaClass const*) + 152	(IOTimerDispatchSource.iig.cpp:845)
7	[305, 83236]
8	[305, 81620]
9	DriverKit	0x1800490cc	OSMetaClassBase::Invoke(IORPC) + 772	(uioserver.cpp:1614)
10	DriverKit	0x18006a700	IODispatchSource::CheckForWork(bool, int (*)(OSMetaClassBase*, IORPC)) + 316	(IODispatchSource.iig.cpp:631)
11	DriverKit	0x18004dffc	invocation function for block in IOTimerDispatchSource::Create_Impl(IODispatchQueue*, IOTimerDispatchSource**) + 192	(uioserver.cpp:4159)
12	libdispatch.dylib	0x180ad4c48	_dispatch_continuation_pop + 600	(queue.c:349)
13	libdispatch.dylib	0x180ae7c84	_dispatch_source_invoke + 2712	(source.c:966)
14	libdispatch.dylib	0x180ad8f44	_dispatch_lane_serial_drain + 336	(queue.c:3991)
15	libdispatch.dylib	0x180ad9bf0	_dispatch_lane_invoke + 440	(queue.c:4082)
16	libdispatch.dylib	0x180adaf0c	_dispatch_workloop_invoke + 1624	(queue.c:4761)
17	libdispatch.dylib	0x180ae42b8	_dispatch_root_queue_drain_deferred_wlh + 292	(queue.c:7265)
18	libdispatch.dylib	0x180ae3ba8	_dispatch_workloop_worker_thread + 692	(queue.c:6859)
19	libsystem_pthread.dylib	0x180c6e66c	_pthread_wqthread + 408	(pthread.c:2696)
20	libsystem_pthread.dylib	0x180c756fc	start_wqthread + 8	

I don't have the symbol data necessary to symbolicate your DEXTs frames, but you can do it using the instructions I referenced earlier. That leads to here:

The error points to the same physical address FAR=0xa40100008 even if the hosts are different.

My guess is that you've got some kind of memory corruption issue in your DEXT, which is then leading to the same physical address getting going into MemoryRead32. The invalid address then panics the kernel.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

DriverKit. Plug/unplug test leads to MacOS panic
 
 
Q