Dear Apple engineers and DriverKit developers,
We have developed a DriverKit (DEXT) driver for an HBA RAID controller.
The RAID controller is connected to hosts through Thunderbolt (PCIe port of the Thunderbolt controller).
We use an IO script to verify the developed driver. The test fails after a few (10-12) hours of running with an error:
“BUG IN CLIENT OF LIBDISPATCH: dispatch_sync called on queue already owned by current thread”.
We inspected the stack trace of the crash report. This error happens in the interrupt handler.
Thread 5 Crashed:
0 libdispatch.dylib 0x19671aa8c __DISPATCH_WAIT_FOR_QUEUE__ + 484
1 libdispatch.dylib 0x19671a5d0 _dispatch_sync_f_slow + 152
2 DriverKit 0x195d3fc1c IODispatchQueue::DispatchSync_f(void*, void (*)(void*)) + 296
3 DriverKit 0x195d40860 IOInterruptDispatchSourceThread(void*) + 380
4 libsystem_pthread.dylib 0x1968a3738 _pthread_start + 140
5 libsystem_pthread.dylib 0x1968ac6c8 thread_start + 8
On our side we created 5 DispatchQueue(s) for the interrupt processing and configured 5 interrupt handlers using ConfigureInterrupts(kIOInterruptTypePCIMessagedX, 1, 5). It gives 5 interrupts, as requested and it is not clear what is the reason for the issue.
Our code samples are below
uint32_t configureInterrupts(uint32_t requested, uint32_t required)
{
const kern_return_t ret = ioPCIdevice->ConfigureInterrupts(kIOInterruptTypePCIMessagedX, required, requested);
if (ret != kIOReturnSuccess)
return 0;
uint64_t interruptType = 0;
uint32_t interruptsCount = 0;
uint32_t interruptIndex = 0;
for ( ;; ++interruptIndex) {
if (IOInterruptDispatchSource::GetInterruptType(ioPCIdevice, interruptIndex, &interruptType) != kIOReturnSuccess)
break;
if ((interruptType & kIOInterruptTypePCIMessagedX) == 0)
continue;
++interruptsCount;
}
return interruptsCount;
}
.....
// Create DQs
for(int i = 0; i < maxInterrupts; ++i) {
ret = IODispatchQueue::Create(INTERRUPT_DQ_NAME, 0, 0, &ivars->interruptQueue[i]);
if (kIOReturnSuccess != ret || nullptr == ivars->interruptQueue[i]) {
GH_PRINT_ERR("Interrupts queue %d creation failed with error %d", i, ret);
return false;
}
}
.....
// Link DQ with interrupt
for(int index = 0; index < maxInterrupts; ++ index) {
kern_return_t ret = CreateActionHandleInterruptRequest(size, &ivars->interruptActions[index]);
if (kIOReturnSuccess != ret) {
GH_PRINT_ERR("Create action for interrupt handler %u failed.", index);
return ;
}
ret = IOInterruptDispatchSource::Create(ivars->PCI_io.dev, index, ivars->interruptQueue[index], &ivars->interruptSources[index]);
if (kIOReturnSuccess != ret || nullptr == ivars->interruptSources[index]) {
GH_PRINT_ERR("Creating interrupt source %u failed for interrupt index %u.", index, index);
return ;
}
ret = ivars->interruptSources[index]->SetHandler(ivars->interruptActions[index]);
if (kIOReturnSuccess != ret) {
GH_PRINT_ERR("Setting the handler for interrupt source %u failed.", index);
return ;
}
ivars->contexts[index] = ivars->interruptActions[index]->GetReference();
}
.....
// definition for interrupt handler
void HandleInterruptRequest ( OSAction *action, uint64_t count, uint64_t time) TYPE (IOInterruptDispatchSource::InterruptOccurred);
Do you have any clue how we can fix this error? Or directions and ways for investigation?
Please let us know if you need more details.
There are no DispatchSync calls in our code. Here is the SCMD_DONE:
Actually, there is, it's just implicit. Correlating your code with the crash log:
Thread 8:
...
2 SCSIControllerDriverKit 0x195e8b364 IOUserSCSIParallelInterfaceController::ParallelTaskCompletion(OSAction*, SCSIUserParallelResponse, int (*)(OSMetaClassBase*, IORPC)) + 156
3 com.mycompany.driverkit.MyDevice 0x1042a6a20 invocation function for block in OsBridge::SCMD_DONE(void*, megasas_cmd_fusion*, SCSIUserParallelResponse*, megasas_cmd_fusion*)
You're here:
ghc->ivars->sentQueue->DispatchAsync(^{
-> ghc->ParallelTaskCompletion(act, lResp);
I don't know enough about the exact details of how you've configured your queues to be certain about what's going on, but I suspect what's going on is that ParallelTaskCompletion is implicitly targeting the crashing thread:
Thread 5 Crashed:: :
0 libdispatch.dylib 0x19671aa8c __DISPATCH_WAIT_FOR_QUEUE__ + 484
1 libdispatch.dylib 0x19671a5d0 _dispatch_sync_f_slow + 152
2 DriverKit 0x195d3fc1c IODispatchQueue::DispatchSync_f(void*, void (*)(void*)) + 296
3 DriverKit 0x195d40860 IOInterruptDispatchSourceThread(void*) + 380
...but I think the code on THAT thread is actually targeting "Thread 8:", creating exactly the crash you're seeing.
In any case, a few different points here:
-
Looking over our own code, I think you could just call "ParallelTaskCompletion" directly (without the DispatchAsync). That's how all of our controllers work; however, our own controllers aren't using multiple queues for interrupt handling, so I'm not sure how that dynamic will play out.
-
What is the relationship between "sentQueue" and the rest of your queue architecture? If you're going to do this, then I think you should be using a dedicated queue* that's used to complete "all" tasks and, in particular, NOT one of your interrupt queues:
*Note that the issue here isn't that it "won't work", it's that inconsistent queue usage opens the door to exactly the kind of bug you're currently struggling with.
On our side, we created 5 DispatchQueue(s) for the interrupt processing.
Expanding on that last point, keep in mind that you're going to end up implicitly serializing ParallelTaskCompletion calls into the kernel (in other words, you can't actually complete two tasks "at the same time"). In terms of maximizing performance, this is why we created the "bundled" command— it lets all of your interrupt handlers fire at the "same"(-ish) time, then complete all of those commands simultaneously.
Lastly, I'm not sure that using separate dispatch queues for this is actually getting you any benefit. I think IOInterruptDispatchSource might be serializing interrupts callbacks inside the kernel, at which point using separate queues is only complicating your implementation.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware