System Panic with IOUserSCSIParallelInterfaceController during Dispatch Queue Configuration

Hello everyone,

We are in the process of migrating a high-performance storage KEXT to DriverKit. During our initial validation phase, we noticed a performance gap between the DEXT and the KEXT, which prompted us to try and optimize our I/O handling process.

Background and Motivation:

Our test hardware is a RAID 0 array of two HDDs. According to AJA System Test, our legacy KEXT achieves a write speed of about 645 MB/s on this hardware, whereas the new DEXT reaches about 565 MB/s. We suspect the primary reason for this performance gap might be that the DEXT, by default, uses a serial work-loop to submit I/O commands, which fails to fully leverage the parallelism of the hardware array.

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

However, during our implementation attempt, we encountered a critical issue that caused a system-wide crash.

The Operation Causing the Panic:

We configured MyParallelIOQueue using the following combination of methods:

  1. In the .iig file: We appended the QUEUENAME(MyParallelIOQueue) macro after the override keyword of the UserProcessParallelTask method declaration.
  2. In the .cpp file: We manually created a queue with the same name by calling the IODispatchQueue::Create() function within our UserInitializeController method.

The Result:

This results in a macOS kernel panic during the DEXT loading process, forcing the user to perform a hard reboot.

After the reboot, checking with the systemextensionsctl list command reveals the DEXT's status as [activated waiting for user], which indicates that it encountered an unrecoverable, fatal error during its initialization.

Key Code Snippets to Reproduce the Panic:

  1. In .iig file - this was our exact implementation:

    class DRV_MAIN_CLASS_NAME: public IOUserSCSIParallelInterfaceController
    {
    public:
        virtual kern_return_t UserProcessParallelTask(...) override
            QUEUENAME(MyParallelIOQueue);
    };
    
  2. In .h file:

    struct DRV_MAIN_CLASS_NAME_IVars {
        // ...
        IODispatchQueue*    MyParallelIOQueue;
    };
    
  3. In UserInitializeController implementation:

    kern_return_t
    IMPL(DRV_MAIN_CLASS_NAME, UserInitializeController)
    {
        // ...
        // We also included code to manually create the queue.
        kern_return_t ret = IODispatchQueue::Create("MyParallelIOQueue",
                                                    kIODispatchQueueReentrant,
                                                    0,
                                                    &ivars->MyParallelIOQueue);
        if (ret != kIOReturnSuccess) {
            // ... error handling ...
        }
        // ...
        return kIOReturnSuccess;
    }
    

Our Question:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

Clarifying this is crucial for all developers pursuing high-performance storage solutions with DriverKit. Any explanation or guidance would be greatly appreciated.

Best Regards,

Charles

Answered by DTS Engineer in 865478022

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

Yeah... that won't work. UserProcessParallelTask is an OSAction target, which is already targeting a queue. I'd be curious to see how the panic() played out*, but I'm not surprised that you panicked.

*I suspect you deadlocked command submission long enough that the SCSI stack gave up and panicked, but that's purely a guess.

That leads to here:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

The answer here is to shift to UserProcessBundledParallelTasks. The architecture is a bit more complex, but it allows you to asynchronously receive and complete tasks in parallel while also reusing command and response buffers to minimize wiring cost. Take a look at the header files from SCSIControllerDriverKit for details on how this architecture works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

Yeah... that won't work. UserProcessParallelTask is an OSAction target, which is already targeting a queue. I'd be curious to see how the panic() played out*, but I'm not surprised that you panicked.

*I suspect you deadlocked command submission long enough that the SCSI stack gave up and panicked, but that's purely a guess.

That leads to here:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

The answer here is to shift to UserProcessBundledParallelTasks. The architecture is a bit more complex, but it allows you to asynchronously receive and complete tasks in parallel while also reusing command and response buffers to minimize wiring cost. Take a look at the header files from SCSIControllerDriverKit for details on how this architecture works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thank you for your previous guidance. We have shifted our driver architecture to the UserProcessBundledParallelTasks model with Shared Command/Response Buffers to optimize I/O performance.

But, we are encountering a persistent Kernel Panic / DEXT Crash (Corpse) immediately after switching to the Bundled mode, preventing the target discovery and initialization sequence from completing, while the legacy mode (UserProcessParallelTask) is rock-solid and stable.

Implementation Details:

  1. Memory Mapping: Successfully implemented UserMapBundledParallelTaskCommandAndResponseBuffers. DEXT correctly obtains the virtual addresses for both buffers.
  2. Dispatching: Inside UserProcessBundledParallelTasks, we iterate through the parallelRequestSlotIndices, reading from the shared command buffer and dispatching tasks to the hardware.
  3. Completion: Upon hardware completion, we populate the shared response buffer in the asynchronous path (ISR/Poll) and invoke BundledParallelTaskCompletion with the corresponding OSAction.

The Critical Race Condition Observed:

Due to the very low latency of our hardware for specific commands (e.g., TEST UNIT READY). Logs reveal a severe timing conflict:

  • The interrupt handler (Completion path) is triggered and successfully invokes BundledParallelTaskCompletion, which returns normally.
  • Crucially, at this exact microsecond, the original UserProcessBundledParallelTasks call (Submission path) has not yet finished its loop or returned to the system.
  • Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Troubleshooting Steps Taken (Issue Persists):

  • Memory Protection: Clamped Sense Data to exactly 18 bytes (SCSI fixed format) to prevent any potential buffer overflows in the shared response buffer.
  • Array Sizing: Ensured that the index array passed to BundledParallelTaskCompletion is a fixed-size uint16_t[32] to align with the .iig declaration and ensure RPC serialization safety.
  • Reference Counting: Tested both calling and omitting release() on the OSAction in the completion path. The crash occurs regardless of manual release.
  • Memory Barriers: Implemented __atomic_thread_fence(__ATOMIC_SEQ_CST) to ensure shared buffer writes are visible before signaling completion.

Questions:

  1. In Bundled mode, does DriverKit support a scenario where the Completion RPC returns before the Submission RPC has finished its execution? Does this cause an internal state-machine conflict within IOUserSCSIParallelInterfaceController?
  2. What is the exact ownership model for the OSAction * action in Bundled mode? Is the action "consumed" by BundledParallelTaskCompletion, or is the driver still responsible for retain/release management as in legacy mode?

Any advice on how to resolve this would be very helpful. Thanks!

Best Regards,

Charles

Due to the very low latency of our hardware for specific commands (e.g., TEST UNIT READY). Logs reveal a severe timing conflict:

  • The interrupt handler (Completion path) is triggered and successfully invokes BundledParallelTaskCompletion, which returns normally.

  • Crucially, at this exact microsecond, the original UserProcessBundledParallelTasks call (Submission path) has not yet finished its loop or returned to the system.

  • Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Something doesn't sounds right here. There are actually a few different things that concern me:

(1)
There's a reasonably long "cycle" to "get" to your interrupt handler from the kernel. I'm skeptical that there's any way for an interrupted to fire before UserProcessBundledParallelTasks finishes running unless UserProcessBundledParallelTasks is doing "something" that substantially delays its own process.

(2)
From the kernel side, how bundled commands actually "work" is that ProcessParallelTask claims a slot for each command, preps the command, then passes that command over to a secondary dispatch "engine" which actually ends up calling into UserProcessBundledParallelTasks(). This lets multiple commands be enqueued in the time gap that delays commands reaching your driver, which is where most of the performance benefit comes from.

However, it also means that the kernel driver is always "done" with any given command WELL before it ever reaches your DEXT. In theory, I think you could immediately complete any given command, directly inside UserProcessBundledParallelTasks, since all of the "bookkeeping" (which UserCompleteBundledParallelTask will manipulate) was actually done BEFORE UserProcessBundledParallelTasks() was called at all.

(3)
In the standard configuration, I believe UserProcessBundledParallelTasks and UserCompleteBundledParallelTask implicitly target the default dispatch queue, so that can't actually be called at the "same" time. Note that this is NOT in fact a meaningful performance bottleneck, as the implementation of both methods is simple enough that they shouldn't have any meaningful effect on each other.

That leads to here:

Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Crashes how? Do you have a crash log for your DEXT? Or is it present in the kernel panic thread capture?

FYI, there's a forum post here and then here that outlines how to fully symbolicate a kernel panic, including all system threads. If your DEXT is still "live", that would show you what it was doing.

  1. What is the exact ownership model for the OSAction * action in Bundled mode? Is the action "consumed" by BundledParallelTaskCompletion, or is the driver still responsible for retain/release management as in legacy mode?

Our own driver implements UserProcessBundledParallelTasks() by building a "local" SCSIUserParallelTask object for each incoming command, then calling into its existing "UserProcessParallelTask" method to handle the actual commands, so it ends up retaining/releasing the completion once for every command that comes in.

Having said that, if you check the actual value you're receiving, I think you'll find that the actual "byte" value you're receiving is always the same, because it's actually a single action that's being reused for every command, so I'm not sure it really matters all that much.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thank you for your valuable insights. Following your advice, we have refactored our driver to use a "wrapper" pattern where UserProcessBundledParallelTasks serves as a high-performance entry point that forwards commands to our core dispatch logic.

To eliminate potential race conditions, we have also moved our interrupt handling to the Default Dispatch Queue using kIOServiceDefaultQueueName. This ensures that command submission and completion are strictly serialized.

Here are the key implementation details showing how we unified the dispatch logic:

1. Legacy Entry Point (Single Task):

We extracted our core logic into a helper method, DispatchTaskInternal, passing 0xFFFF as a placeholder for the Slot Index.

kern_return_t MyDriver::UserProcessParallelTask_Impl(
    SCSIUserParallelTask parallelRequest,
    uint32_t *response,
    OSAction *completion)
{
    // Forward to unified internal dispatcher with no slot index (Legacy Mode)
    return DispatchTaskInternal(parallelRequest, response, completion, 0xFFFF);
}

2. Bundled Submission Path:

In our UserProcessBundledParallelTasks implementation, we iterate through the indices, create a local object copy of the request (as you suggested), and call the same internal dispatcher.

void MyDriver::UserProcessBundledParallelTasks_Impl(
    const uint16_t parallelRequestSlotIndices[32],
    uint16_t parallelRequestSlotIndicesCount,
    OSAction * completion)
{
    for (uint16_t i = 0; i < parallelRequestSlotIndicesCount; i++) {
        uint16_t slotIndex = parallelRequestSlotIndices[i];
        const SCSIUserParallelTask& sharedReq = ivars->fCommandBuffers[slotIndex];
        
        // Create a local copy to utilize existing processing logic
        SCSIUserParallelTask localReq = sharedReq;
        uint32_t response;
        
        // DispatchTaskInternal performs a completion->retain() for EVERY command,
        // ensuring the action stays alive regardless of the bundled count.
        DispatchTaskInternal(localReq, &response, completion, slotIndex);
    }
}

3. Unified Asynchronous Completion Path:

Regardless of the mode, we now use the stable ParallelTaskCompletion within the interrupt handler (ISR), followed by the corresponding release().

// Inside the Asynchronous Interrupt Handler (serialized on Default Queue)
SCSIUserParallelResponse taskResponse = {};
// ... Populate status, TaskID, and Version ...

// Signal completion to the framework
ParallelTaskCompletion(completion, taskResponse);

// Balance the reference count performed during submission
completion->release();

Observations and Current Bottleneck:

If we intercept a command (like TEST UNIT READY) and call the completion logic "synchronously" within the UserProcessBundledParallelTasks scope, the system works perfectly and immediately issues the next command (e.g., INQUIRY).

As soon as the completion is triggered from the asynchronous interrupt path—even though it is serialized on the same Default Queue—invoking ParallelTaskCompletion (or BundledParallelTaskCompletion) causes the DEXT to crash immediately (Corpse / Address size fault).

Questions:

Since we have ensured an independent retain/release cycle for every command and used a serialized queue, why does the OSAction handle appear to become invalid once it enters an asynchronous callback context in Bundled mode? Is there a hidden thread-context or lifecycle restriction on the OSAction provided during a bundled call that differs from legacy mode?

We have verified that the legacy mode remains 100% stable under the same serialized queue configuration. We would appreciate any further guidance on why the "Sync succeeds, Async crashes" behavior persists in the high-performance path.

Best Regards,

Charles

To eliminate potential race conditions, we have also moved our interrupt handling to the Default Dispatch Queue using kIOServiceDefaultQueueName.

I'm not sure you want to do this, as I think it might end up causing BundledParallelTaskCompletion() to deadlock itself.

As soon as the completion is triggered from the asynchronous interrupt path—even though it is serialized on the same Default Queue—invoking ParallelTaskCompletion (or BundledParallelTaskCompletion) causes the DEXT to crash immediately (Corpse / Address size fault).

Can you post the crash log for this?

We extracted our core logic into a helper method, DispatchTaskInternal, passing 0xFFFF as a placeholder for the Slot Index.

Is your "DispatchTaskInternal" method just a simple helper method/function or is it actually running on a different queue? If it's on a different queue, then I would strongly suggest that you eliminate that entirely and just do "everything" on the same thread.

The issue here is that, in practice, most of what your driver actually "does" takes VERY little time (short enough intervals that even measuring it is somewhat difficult), which can easily create a situation where the process of moving work between threads is actually taking FAR more time than the work itself. The net result is both slower and more complicated, without any real benefit.

In our UserProcessBundledParallelTasks implementation, we iterate through the indices, create a local object copy of the request (as you suggested), and call the same internal dispatcher.

That all looks reasonable.

Regardless of the mode, we now use the stable ParallelTaskCompletion within the interrupt handler (ISR).

That's incorrect. You MUST use BundledParallelTaskCompletion with the bundled I/O path. Note that, just like work arrives in "clusters", it's expected that your interrupt handler will be able to process multiple commands every time it fires, so that BundledParallelTaskCompletion can clear multiple slots at a time.

As soon as the completion is triggered from the asynchronous interrupt path—even though it is serialized on the same Default Queue—invoking ParallelTaskCompletion (or BundledParallelTaskCompletion) causes the DEXT to crash immediately (Corpse / Address size fault).

So, I spent some more time looking at our driver that implements BundledParallelTaskCompletion and I'm increasingly convinced that it's unintentionally over-retaining the action handler. So, three request/suggestions:

  1. Please file a bug asking us to clarify how BundledParallelTaskCompletion should be handled and post the bug number back here.

  2. Check that value you're receiving for "completion" in UserProcessBundledParallelTasks(). I think you'll find that you're always receiving the same value, but I'd like to confirm that.

  3. For basic testing purposes, try over retaining completion and retest. If you want to maintain your existing retain/release code, just add an extra retain at the start of UserProcessBundledParallelTasks().

That should sort out whether or not the management of action itself is directly involved here.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thank you for your detailed feedback and for spending time looking at your internal driver implementation. I have followed your suggestions and filed a formal bug report.

1. Bug Report Filed

I have filed a bug report via Feedback Assistant.

Feedback ID: FB21636775

I have attached the original and symbolicated crash logs, along with reduced code snippets showing our implementation.

2. OSAction Pointer Confirmation

I have verified the value of the completion pointer received in UserProcessBundledParallelTasks. As you suspected, the pointer address is identical for all commands within a single bundle.

3. Crash Log Insights

The symbolicated crash log confirms that the panic is triggered by an __assert_rtn inside OSMetaClassBase::QueueForObject during the call to completion.

Specifically, when we attempted the "Unified Path" (using legacy ParallelTaskCompletion for bundled commands) as a stability test, it triggered an immediate panic, which confirms your point that we MUST use BundledParallelTaskCompletion for the bundled path.

Crashed Thread Backtrace (Summary):

Thread 4 Crashed:: Dispatch queue: DriverKitAcxxx-Default
...
4   DriverKit                     0x19e255fb8 __assert_rtn + 88
5   DriverKit                     0x19e25629c OSMetaClassBase::QueueForObject(...)
6   DriverKit                     0x19e2267bc OSMetaClassBase::QueueForObject(...)
7   DriverKit                     0x19e226fac OSMetaClassBase::Invoke(IORPC) + 476
8   SCSIControllerDriverKit       0x19e373364 IOUserSCSIParallelInterfaceController::ParallelTaskCompletion(...)
9   <BundleID>                0x1003156a4 MyDriver::InterruptHandler(...)

Next Steps:

I am now proceeding with the "over-retaining" experiment you suggested (adding an extra retain at the start of the bundled submission loop) to see if it resolves the asynchronous crash when using BundledParallelTaskCompletion. I will update you with those results shortly.

Best Regards,

Charles

Hi Kevin,

Thank you for the suggestion. We have performed the "over-retaining" experiment by adding an extra retain() to the OSAction at the start of the UserProcessBundledParallelTasks loop.

Unfortunately, the result remains the same: the DEXT still triggers a Corpse crash immediately upon calling the completion signal in the asynchronous path.

As the legacy mode (UserProcessParallelTask) remains 100% stable under our newly serialized queue configuration, we have decided to revert to the legacy path for now to continue our product development while we wait for investigation on the bug report (FB21636775).

Any further insights into why the asynchronous completion fails specifically in Bundled mode would be greatly appreciated.

Best Regards,

Charles

System Panic with IOUserSCSIParallelInterfaceController during Dispatch Queue Configuration
 
 
Q