DriverKit. Data Integrity issue.

Dear Apple engineers and DriverKit developers,
We have developed a DriverKit (DEXT) driver for an HBA RAID controller.
The RAID controller is connected to hosts through Thunderbolt (PCIe port of the Thunderbolt controller).
We use an IO script to verify the developed driver. The test fails after a few (10-12) hours of running with an error:
“BUG IN CLIENT OF LIBDISPATCH: dispatch_sync called on queue already owned by current thread”.

We inspected the stack trace of the crash report. This error happens in the interrupt handler.

Thread 5 Crashed: 

0   libdispatch.dylib                   0x19671aa8c __DISPATCH_WAIT_FOR_QUEUE__ + 484 
1   libdispatch.dylib                   0x19671a5d0 _dispatch_sync_f_slow + 152 
2   DriverKit                           0x195d3fc1c IODispatchQueue::DispatchSync_f(void*, void (*)(void*)) + 296 
3   DriverKit                           0x195d40860 IOInterruptDispatchSourceThread(void*) + 380 
4   libsystem_pthread.dylib             0x1968a3738 _pthread_start + 140 
5   libsystem_pthread.dylib             0x1968ac6c8 thread_start + 8

On our side we created 5 DispatchQueue(s) for the interrupt processing and configured 5 interrupt handlers using ConfigureInterrupts(kIOInterruptTypePCIMessagedX, 1, 5). It gives 5 interrupts, as requested and it is not clear what is the reason for the issue.

Our code samples are below

uint32_t configureInterrupts(uint32_t requested, uint32_t required) 

{ 

    const kern_return_t ret = ioPCIdevice->ConfigureInterrupts(kIOInterruptTypePCIMessagedX, required, requested); 

    if (ret != kIOReturnSuccess) 
        return 0; 

    uint64_t interruptType = 0; 
    uint32_t interruptsCount = 0; 
    uint32_t interruptIndex = 0; 

 
    for ( ;; ++interruptIndex) { 
        if (IOInterruptDispatchSource::GetInterruptType(ioPCIdevice, interruptIndex, &interruptType) != kIOReturnSuccess) 
            break; 

        if ((interruptType & kIOInterruptTypePCIMessagedX) == 0) 
            continue; 

        ++interruptsCount; 
    } 
    return interruptsCount; 
} 
.....


// Create DQs
    for(int i = 0; i < maxInterrupts; ++i) { 
        ret = IODispatchQueue::Create(INTERRUPT_DQ_NAME, 0, 0, &ivars->interruptQueue[i]); 
        if (kIOReturnSuccess != ret || nullptr == ivars->interruptQueue[i]) { 
            GH_PRINT_ERR("Interrupts queue %d creation failed with error %d", i, ret); 
            return false; 
        } 
    } 
.....


// Link DQ with interrupt 
 for(int index = 0; index < maxInterrupts; ++ index) { 
    kern_return_t ret = CreateActionHandleInterruptRequest(size, &ivars->interruptActions[index]); 
    if (kIOReturnSuccess != ret) { 
        GH_PRINT_ERR("Create action for interrupt handler %u failed.", index); 
        return ; 
    } 

    ret = IOInterruptDispatchSource::Create(ivars->PCI_io.dev, index, ivars->interruptQueue[index], &ivars->interruptSources[index]); 
    if (kIOReturnSuccess != ret || nullptr == ivars->interruptSources[index]) { 
        GH_PRINT_ERR("Creating interrupt source %u failed for interrupt index %u.", index, index); 
        return ; 
    } 

    ret = ivars->interruptSources[index]->SetHandler(ivars->interruptActions[index]); 
    if (kIOReturnSuccess != ret) { 
        GH_PRINT_ERR("Setting the handler for interrupt source %u failed.", index); 
        return ; 
    } 
    ivars->contexts[index] = ivars->interruptActions[index]->GetReference();     
} 
.....
 

// definition for interrupt handler
void    HandleInterruptRequest ( OSAction *action, uint64_t count, uint64_t time) TYPE (IOInterruptDispatchSource::InterruptOccurred); 

 

Do you have any clue how we can fix this error? Or directions and ways for investigation?
Please let us know if you need more details.

Answered by DTS Engineer in 861679022

There are no DispatchSync calls in our code. Here is the SCMD_DONE:

Actually, there is, it's just implicit. Correlating your code with the crash log:

Thread 8:
...
2   SCSIControllerDriverKit         	       0x195e8b364 IOUserSCSIParallelInterfaceController::ParallelTaskCompletion(OSAction*, SCSIUserParallelResponse, int (*)(OSMetaClassBase*, IORPC)) + 156 
3   com.mycompany.driverkit.MyDevice	       0x1042a6a20 invocation function for block in OsBridge::SCMD_DONE(void*, megasas_cmd_fusion*, SCSIUserParallelResponse*, megasas_cmd_fusion*)

You're here:

ghc->ivars->sentQueue->DispatchAsync(^{

->    ghc->ParallelTaskCompletion(act, lResp);

I don't know enough about the exact details of how you've configured your queues to be certain about what's going on, but I suspect what's going on is that ParallelTaskCompletion is implicitly targeting the crashing thread:

Thread 5 Crashed:: :
0   libdispatch.dylib             	       0x19671aa8c __DISPATCH_WAIT_FOR_QUEUE__ + 484 
1   libdispatch.dylib             	       0x19671a5d0 _dispatch_sync_f_slow + 152 
2   DriverKit                     	       0x195d3fc1c IODispatchQueue::DispatchSync_f(void*, void (*)(void*)) + 296 
3   DriverKit                     	       0x195d40860 IOInterruptDispatchSourceThread(void*) + 380 

...but I think the code on THAT thread is actually targeting "Thread 8:", creating exactly the crash you're seeing.

In any case, a few different points here:

  • Looking over our own code, I think you could just call "ParallelTaskCompletion" directly (without the DispatchAsync). That's how all of our controllers work; however, our own controllers aren't using multiple queues for interrupt handling, so I'm not sure how that dynamic will play out.

  • What is the relationship between "sentQueue" and the rest of your queue architecture? If you're going to do this, then I think you should be using a dedicated queue* that's used to complete "all" tasks and, in particular, NOT one of your interrupt queues:

*Note that the issue here isn't that it "won't work", it's that inconsistent queue usage opens the door to exactly the kind of bug you're currently struggling with.

On our side, we created 5 DispatchQueue(s) for the interrupt processing.

Expanding on that last point, keep in mind that you're going to end up implicitly serializing ParallelTaskCompletion calls into the kernel (in other words, you can't actually complete two tasks "at the same time"). In terms of maximizing performance, this is why we created the "bundled" command— it lets all of your interrupt handlers fire at the "same"(-ish) time, then complete all of those commands simultaneously.

Lastly, I'm not sure that using separate dispatch queues for this is actually getting you any benefit. I think IOInterruptDispatchSource might be serializing interrupts callbacks inside the kernel, at which point using separate queues is only complicating your implementation.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Please post the full crash log and I'll se what I can determine.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hello, Please find crash log attached

Regards

So, let me start here:

“BUG IN CLIENT OF LIBDISPATCH: dispatch_sync called on queue already owned by current thread”.

This particular error happens any time you make a call into a dispatch queue which would inherently deadlock that queue. The most straightforward case would be calling DispatchSync inside code that's already running on that same queue, though the design of DriverKit can make it less obvious that this is what's going on. That leads me to thread 8:

Thread 8:
0   DriverKit                       	       0x195d4d29c iokit_user_client_trap + 8
1   DriverKit                       	       0x195d3cf00 OSMetaClassBase::Invoke(IORPC) + 312 
2   SCSIControllerDriverKit         	       0x195e8b364 IOUserSCSIParallelInterfaceController::ParallelTaskCompletion(OSAction*, SCSIUserParallelResponse, int (*)(OSMetaClassBase*, IORPC)) + 156 
3   com.mycompany.driverkit.MyDevice	       0x1042a6a20 invocation function for block in OsBridge::SCMD_DONE(void*, megasas_cmd_fusion*, SCSIUserParallelResponse*, megasas_cmd_fusion*)
4   DriverKit                       	       0x195d3f918 IODispatchQueueBlockFunction(void*) + 32 
5   DriverKit                       	       0x195d3f9dc invocation function for block in IODispatchQueue::DispatchAsync_f(void*, void (*)(void*)) + 164 

What is "SCMD_DONE" actually doing? What queue was it called on and what is calling out "to"?

More broadly, do your other crash logs show the same pattern or do they point to other locations?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hello, The crash is stable and always reproduced with the same error pattern. Another example of the crash is attached

There are no DispatchSync calls in our code. Here is the SCMD_DONE:

void OsBridge::SCMD_DONE(void *gh, base_cmd * cmd, SCSIUserParallelResponse * resp) {

    Base * ghc = (Base * )gh;

    SCSIUserParallelResponse lResp = *resp;

    lResp.version = kScsiUserParallelTaskResponseCurrentVersion1;

    lResp.fControllerTaskIdentifier = cmd->req.fControllerTaskIdentifier;

    lResp.fTargetID = cmd->req.fTargetID;

    __block OSAction * act = cmd->action;

    ghc->clean_cmd(cmd);

    ghc->ivars->sentQueue->DispatchAsync(^{

        ghc->ParallelTaskCompletion(act, lResp);

        OSSafeReleaseNULL(act);

    });

}
  • SCMD_DONE - can be called from the Interrupt processing thread (Interrupt DispatchQueue)
  • sentQueue - DispatchQueue is used only for ParallelTaskCompletion calls.

We understand that the reason for the issue is a deadlock. We do not understand the reason and how we can fix it. Could you give any hints?

There are no DispatchSync calls in our code. Here is the SCMD_DONE:

Actually, there is, it's just implicit. Correlating your code with the crash log:

Thread 8:
...
2   SCSIControllerDriverKit         	       0x195e8b364 IOUserSCSIParallelInterfaceController::ParallelTaskCompletion(OSAction*, SCSIUserParallelResponse, int (*)(OSMetaClassBase*, IORPC)) + 156 
3   com.mycompany.driverkit.MyDevice	       0x1042a6a20 invocation function for block in OsBridge::SCMD_DONE(void*, megasas_cmd_fusion*, SCSIUserParallelResponse*, megasas_cmd_fusion*)

You're here:

ghc->ivars->sentQueue->DispatchAsync(^{

->    ghc->ParallelTaskCompletion(act, lResp);

I don't know enough about the exact details of how you've configured your queues to be certain about what's going on, but I suspect what's going on is that ParallelTaskCompletion is implicitly targeting the crashing thread:

Thread 5 Crashed:: :
0   libdispatch.dylib             	       0x19671aa8c __DISPATCH_WAIT_FOR_QUEUE__ + 484 
1   libdispatch.dylib             	       0x19671a5d0 _dispatch_sync_f_slow + 152 
2   DriverKit                     	       0x195d3fc1c IODispatchQueue::DispatchSync_f(void*, void (*)(void*)) + 296 
3   DriverKit                     	       0x195d40860 IOInterruptDispatchSourceThread(void*) + 380 

...but I think the code on THAT thread is actually targeting "Thread 8:", creating exactly the crash you're seeing.

In any case, a few different points here:

  • Looking over our own code, I think you could just call "ParallelTaskCompletion" directly (without the DispatchAsync). That's how all of our controllers work; however, our own controllers aren't using multiple queues for interrupt handling, so I'm not sure how that dynamic will play out.

  • What is the relationship between "sentQueue" and the rest of your queue architecture? If you're going to do this, then I think you should be using a dedicated queue* that's used to complete "all" tasks and, in particular, NOT one of your interrupt queues:

*Note that the issue here isn't that it "won't work", it's that inconsistent queue usage opens the door to exactly the kind of bug you're currently struggling with.

On our side, we created 5 DispatchQueue(s) for the interrupt processing.

Expanding on that last point, keep in mind that you're going to end up implicitly serializing ParallelTaskCompletion calls into the kernel (in other words, you can't actually complete two tasks "at the same time"). In terms of maximizing performance, this is why we created the "bundled" command— it lets all of your interrupt handlers fire at the "same"(-ish) time, then complete all of those commands simultaneously.

Lastly, I'm not sure that using separate dispatch queues for this is actually getting you any benefit. I think IOInterruptDispatchSource might be serializing interrupts callbacks inside the kernel, at which point using separate queues is only complicating your implementation.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

DriverKit. Data Integrity issue.
 
 
Q