System Panic with IOUserSCSIParallelInterfaceController during Dispatch Queue Configuration

Hello everyone,

We are in the process of migrating a high-performance storage KEXT to DriverKit. During our initial validation phase, we noticed a performance gap between the DEXT and the KEXT, which prompted us to try and optimize our I/O handling process.

Background and Motivation:

Our test hardware is a RAID 0 array of two HDDs. According to AJA System Test, our legacy KEXT achieves a write speed of about 645 MB/s on this hardware, whereas the new DEXT reaches about 565 MB/s. We suspect the primary reason for this performance gap might be that the DEXT, by default, uses a serial work-loop to submit I/O commands, which fails to fully leverage the parallelism of the hardware array.

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

However, during our implementation attempt, we encountered a critical issue that caused a system-wide crash.

The Operation Causing the Panic:

We configured MyParallelIOQueue using the following combination of methods:

  1. In the .iig file: We appended the QUEUENAME(MyParallelIOQueue) macro after the override keyword of the UserProcessParallelTask method declaration.
  2. In the .cpp file: We manually created a queue with the same name by calling the IODispatchQueue::Create() function within our UserInitializeController method.

The Result:

This results in a macOS kernel panic during the DEXT loading process, forcing the user to perform a hard reboot.

After the reboot, checking with the systemextensionsctl list command reveals the DEXT's status as [activated waiting for user], which indicates that it encountered an unrecoverable, fatal error during its initialization.

Key Code Snippets to Reproduce the Panic:

  1. In .iig file - this was our exact implementation:

    class DRV_MAIN_CLASS_NAME: public IOUserSCSIParallelInterfaceController
    {
    public:
        virtual kern_return_t UserProcessParallelTask(...) override
            QUEUENAME(MyParallelIOQueue);
    };
    
  2. In .h file:

    struct DRV_MAIN_CLASS_NAME_IVars {
        // ...
        IODispatchQueue*    MyParallelIOQueue;
    };
    
  3. In UserInitializeController implementation:

    kern_return_t
    IMPL(DRV_MAIN_CLASS_NAME, UserInitializeController)
    {
        // ...
        // We also included code to manually create the queue.
        kern_return_t ret = IODispatchQueue::Create("MyParallelIOQueue",
                                                    kIODispatchQueueReentrant,
                                                    0,
                                                    &ivars->MyParallelIOQueue);
        if (ret != kIOReturnSuccess) {
            // ... error handling ...
        }
        // ...
        return kIOReturnSuccess;
    }
    

Our Question:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

Clarifying this is crucial for all developers pursuing high-performance storage solutions with DriverKit. Any explanation or guidance would be greatly appreciated.

Best Regards,

Charles

Answered by DTS Engineer in 865478022

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

Yeah... that won't work. UserProcessParallelTask is an OSAction target, which is already targeting a queue. I'd be curious to see how the panic() played out*, but I'm not surprised that you panicked.

*I suspect you deadlocked command submission long enough that the SCSI stack gave up and panicked, but that's purely a guess.

That leads to here:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

The answer here is to shift to UserProcessBundledParallelTasks. The architecture is a bit more complex, but it allows you to asynchronously receive and complete tasks in parallel while also reusing command and response buffers to minimize wiring cost. Take a look at the header files from SCSIControllerDriverKit for details on how this architecture works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

Yeah... that won't work. UserProcessParallelTask is an OSAction target, which is already targeting a queue. I'd be curious to see how the panic() played out*, but I'm not surprised that you panicked.

*I suspect you deadlocked command submission long enough that the SCSI stack gave up and panicked, but that's purely a guess.

That leads to here:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

The answer here is to shift to UserProcessBundledParallelTasks. The architecture is a bit more complex, but it allows you to asynchronously receive and complete tasks in parallel while also reusing command and response buffers to minimize wiring cost. Take a look at the header files from SCSIControllerDriverKit for details on how this architecture works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thank you for your previous guidance. We have shifted our driver architecture to the UserProcessBundledParallelTasks model with Shared Command/Response Buffers to optimize I/O performance.

But, we are encountering a persistent Kernel Panic / DEXT Crash (Corpse) immediately after switching to the Bundled mode, preventing the target discovery and initialization sequence from completing, while the legacy mode (UserProcessParallelTask) is rock-solid and stable.

Implementation Details:

  1. Memory Mapping: Successfully implemented UserMapBundledParallelTaskCommandAndResponseBuffers. DEXT correctly obtains the virtual addresses for both buffers.
  2. Dispatching: Inside UserProcessBundledParallelTasks, we iterate through the parallelRequestSlotIndices, reading from the shared command buffer and dispatching tasks to the hardware.
  3. Completion: Upon hardware completion, we populate the shared response buffer in the asynchronous path (ISR/Poll) and invoke BundledParallelTaskCompletion with the corresponding OSAction.

The Critical Race Condition Observed:

Due to the very low latency of our hardware for specific commands (e.g., TEST UNIT READY). Logs reveal a severe timing conflict:

  • The interrupt handler (Completion path) is triggered and successfully invokes BundledParallelTaskCompletion, which returns normally.
  • Crucially, at this exact microsecond, the original UserProcessBundledParallelTasks call (Submission path) has not yet finished its loop or returned to the system.
  • Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Troubleshooting Steps Taken (Issue Persists):

  • Memory Protection: Clamped Sense Data to exactly 18 bytes (SCSI fixed format) to prevent any potential buffer overflows in the shared response buffer.
  • Array Sizing: Ensured that the index array passed to BundledParallelTaskCompletion is a fixed-size uint16_t[32] to align with the .iig declaration and ensure RPC serialization safety.
  • Reference Counting: Tested both calling and omitting release() on the OSAction in the completion path. The crash occurs regardless of manual release.
  • Memory Barriers: Implemented __atomic_thread_fence(__ATOMIC_SEQ_CST) to ensure shared buffer writes are visible before signaling completion.

Questions:

  1. In Bundled mode, does DriverKit support a scenario where the Completion RPC returns before the Submission RPC has finished its execution? Does this cause an internal state-machine conflict within IOUserSCSIParallelInterfaceController?
  2. What is the exact ownership model for the OSAction * action in Bundled mode? Is the action "consumed" by BundledParallelTaskCompletion, or is the driver still responsible for retain/release management as in legacy mode?

Any advice on how to resolve this would be very helpful. Thanks!

Best Regards,

Charles

Due to the very low latency of our hardware for specific commands (e.g., TEST UNIT READY). Logs reveal a severe timing conflict:

  • The interrupt handler (Completion path) is triggered and successfully invokes BundledParallelTaskCompletion, which returns normally.

  • Crucially, at this exact microsecond, the original UserProcessBundledParallelTasks call (Submission path) has not yet finished its loop or returned to the system.

  • Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Something doesn't sounds right here. There are actually a few different things that concern me:

(1)
There's a reasonably long "cycle" to "get" to your interrupt handler from the kernel. I'm skeptical that there's any way for an interrupted to fire before UserProcessBundledParallelTasks finishes running unless UserProcessBundledParallelTasks is doing "something" that substantially delays its own process.

(2)
From the kernel side, how bundled commands actually "work" is that ProcessParallelTask claims a slot for each command, preps the command, then passes that command over to a secondary dispatch "engine" which actually ends up calling into UserProcessBundledParallelTasks(). This lets multiple commands be enqueued in the time gap that delays commands reaching your driver, which is where most of the performance benefit comes from.

However, it also means that the kernel driver is always "done" with any given command WELL before it ever reaches your DEXT. In theory, I think you could immediately complete any given command, directly inside UserProcessBundledParallelTasks, since all of the "bookkeeping" (which UserCompleteBundledParallelTask will manipulate) was actually done BEFORE UserProcessBundledParallelTasks() was called at all.

(3)
In the standard configuration, I believe UserProcessBundledParallelTasks and UserCompleteBundledParallelTask implicitly target the default dispatch queue, so that can't actually be called at the "same" time. Note that this is NOT in fact a meaningful performance bottleneck, as the implementation of both methods is simple enough that they shouldn't have any meaningful effect on each other.

That leads to here:

Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Crashes how? Do you have a crash log for your DEXT? Or is it present in the kernel panic thread capture?

FYI, there's a forum post here and then here that outlines how to fully symbolicate a kernel panic, including all system threads. If your DEXT is still "live", that would show you what it was doing.

  1. What is the exact ownership model for the OSAction * action in Bundled mode? Is the action "consumed" by BundledParallelTaskCompletion, or is the driver still responsible for retain/release management as in legacy mode?

Our own driver implements UserProcessBundledParallelTasks() by building a "local" SCSIUserParallelTask object for each incoming command, then calling into its existing "UserProcessParallelTask" method to handle the actual commands, so it ends up retaining/releasing the completion once for every command that comes in.

Having said that, if you check the actual value you're receiving, I think you'll find that the actual "byte" value you're receiving is always the same, because it's actually a single action that's being reused for every command, so I'm not sure it really matters all that much.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thank you for your valuable insights. Following your advice, we have refactored our driver to use a "wrapper" pattern where UserProcessBundledParallelTasks serves as a high-performance entry point that forwards commands to our core dispatch logic.

To eliminate potential race conditions, we have also moved our interrupt handling to the Default Dispatch Queue using kIOServiceDefaultQueueName. This ensures that command submission and completion are strictly serialized.

Here are the key implementation details showing how we unified the dispatch logic:

1. Legacy Entry Point (Single Task):

We extracted our core logic into a helper method, DispatchTaskInternal, passing 0xFFFF as a placeholder for the Slot Index.

kern_return_t MyDriver::UserProcessParallelTask_Impl(
    SCSIUserParallelTask parallelRequest,
    uint32_t *response,
    OSAction *completion)
{
    // Forward to unified internal dispatcher with no slot index (Legacy Mode)
    return DispatchTaskInternal(parallelRequest, response, completion, 0xFFFF);
}

2. Bundled Submission Path:

In our UserProcessBundledParallelTasks implementation, we iterate through the indices, create a local object copy of the request (as you suggested), and call the same internal dispatcher.

void MyDriver::UserProcessBundledParallelTasks_Impl(
    const uint16_t parallelRequestSlotIndices[32],
    uint16_t parallelRequestSlotIndicesCount,
    OSAction * completion)
{
    for (uint16_t i = 0; i < parallelRequestSlotIndicesCount; i++) {
        uint16_t slotIndex = parallelRequestSlotIndices[i];
        const SCSIUserParallelTask& sharedReq = ivars->fCommandBuffers[slotIndex];
        
        // Create a local copy to utilize existing processing logic
        SCSIUserParallelTask localReq = sharedReq;
        uint32_t response;
        
        // DispatchTaskInternal performs a completion->retain() for EVERY command,
        // ensuring the action stays alive regardless of the bundled count.
        DispatchTaskInternal(localReq, &response, completion, slotIndex);
    }
}

3. Unified Asynchronous Completion Path:

Regardless of the mode, we now use the stable ParallelTaskCompletion within the interrupt handler (ISR), followed by the corresponding release().

// Inside the Asynchronous Interrupt Handler (serialized on Default Queue)
SCSIUserParallelResponse taskResponse = {};
// ... Populate status, TaskID, and Version ...

// Signal completion to the framework
ParallelTaskCompletion(completion, taskResponse);

// Balance the reference count performed during submission
completion->release();

Observations and Current Bottleneck:

If we intercept a command (like TEST UNIT READY) and call the completion logic "synchronously" within the UserProcessBundledParallelTasks scope, the system works perfectly and immediately issues the next command (e.g., INQUIRY).

As soon as the completion is triggered from the asynchronous interrupt path—even though it is serialized on the same Default Queue—invoking ParallelTaskCompletion (or BundledParallelTaskCompletion) causes the DEXT to crash immediately (Corpse / Address size fault).

Questions:

Since we have ensured an independent retain/release cycle for every command and used a serialized queue, why does the OSAction handle appear to become invalid once it enters an asynchronous callback context in Bundled mode? Is there a hidden thread-context or lifecycle restriction on the OSAction provided during a bundled call that differs from legacy mode?

We have verified that the legacy mode remains 100% stable under the same serialized queue configuration. We would appreciate any further guidance on why the "Sync succeeds, Async crashes" behavior persists in the high-performance path.

Best Regards,

Charles

System Panic with IOUserSCSIParallelInterfaceController during Dispatch Queue Configuration
 
 
Q