System Panic with IOUserSCSIParallelInterfaceController during Dispatch Queue Configuration

Hello everyone,

We are in the process of migrating a high-performance storage KEXT to DriverKit. During our initial validation phase, we noticed a performance gap between the DEXT and the KEXT, which prompted us to try and optimize our I/O handling process.

Background and Motivation:

Our test hardware is a RAID 0 array of two HDDs. According to AJA System Test, our legacy KEXT achieves a write speed of about 645 MB/s on this hardware, whereas the new DEXT reaches about 565 MB/s. We suspect the primary reason for this performance gap might be that the DEXT, by default, uses a serial work-loop to submit I/O commands, which fails to fully leverage the parallelism of the hardware array.

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

However, during our implementation attempt, we encountered a critical issue that caused a system-wide crash.

The Operation Causing the Panic:

We configured MyParallelIOQueue using the following combination of methods:

  1. In the .iig file: We appended the QUEUENAME(MyParallelIOQueue) macro after the override keyword of the UserProcessParallelTask method declaration.
  2. In the .cpp file: We manually created a queue with the same name by calling the IODispatchQueue::Create() function within our UserInitializeController method.

The Result:

This results in a macOS kernel panic during the DEXT loading process, forcing the user to perform a hard reboot.

After the reboot, checking with the systemextensionsctl list command reveals the DEXT's status as [activated waiting for user], which indicates that it encountered an unrecoverable, fatal error during its initialization.

Key Code Snippets to Reproduce the Panic:

  1. In .iig file - this was our exact implementation:

    class DRV_MAIN_CLASS_NAME: public IOUserSCSIParallelInterfaceController
    {
    public:
        virtual kern_return_t UserProcessParallelTask(...) override
            QUEUENAME(MyParallelIOQueue);
    };
    
  2. In .h file:

    struct DRV_MAIN_CLASS_NAME_IVars {
        // ...
        IODispatchQueue*    MyParallelIOQueue;
    };
    
  3. In UserInitializeController implementation:

    kern_return_t
    IMPL(DRV_MAIN_CLASS_NAME, UserInitializeController)
    {
        // ...
        // We also included code to manually create the queue.
        kern_return_t ret = IODispatchQueue::Create("MyParallelIOQueue",
                                                    kIODispatchQueueReentrant,
                                                    0,
                                                    &ivars->MyParallelIOQueue);
        if (ret != kIOReturnSuccess) {
            // ... error handling ...
        }
        // ...
        return kIOReturnSuccess;
    }
    

Our Question:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

Clarifying this is crucial for all developers pursuing high-performance storage solutions with DriverKit. Any explanation or guidance would be greatly appreciated.

Best Regards,

Charles

Answered by DTS Engineer in 865478022

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

Yeah... that won't work. UserProcessParallelTask is an OSAction target, which is already targeting a queue. I'd be curious to see how the panic() played out*, but I'm not surprised that you panicked.

*I suspect you deadlocked command submission long enough that the SCSI stack gave up and panicked, but that's purely a guess.

That leads to here:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

The answer here is to shift to UserProcessBundledParallelTasks. The architecture is a bit more complex, but it allows you to asynchronously receive and complete tasks in parallel while also reusing command and response buffers to minimize wiring cost. Take a look at the header files from SCSIControllerDriverKit for details on how this architecture works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

The Bundled Mode implementation is stable now, and we’ve been running some benchmarks based on your suggestions about 16KB page sizes and real-world performance.

We analyzed the count parameter in UserProcessBundledParallelTasks specifically during the 16K Random I/O test session. Across 10,147 processed batches, about 15.6% of the calls had multiple requests bundled together, with a peak of 6 slots in a single RPC. While the majority (84.4%) were still dispatched with count=1 due to the high responsiveness of the DEXT, the batching definitely kicks in as the load increases.

Regarding the throughput—we noticed that the results fluctuate based on the system's cache state and the physical limits of the 4-HDD RAID 5 setup. Since 4K and 16K random writes are smaller than the RAID stripe size, they often trigger Read-Modify-Write (RMW) cycles, making mechanical seek time the primary bottleneck.

Even with these hardware constraints, Bundled Mode is a clear winner for random reads. Here is a summary of the numbers we observed under normal system cache conditions:


Test CaseBundled ModeLegacy ModeDifference
4K Random Read131 MiB/s94.7 MiB/s+38.4%
4K Random Write35.5 MiB/s33.1 MiB/s+7.3%
16K Random Read222 MiB/s198 MiB/s+12.1%
16K Random Write2098 MiB/s2160 MiB/sComparable (Cache Hit)

The 38.4% gain in 4K Random Reads is particularly impressive. It shows that by reducing IPC overhead and context switches, we can get commands to the hardware queue much faster, which significantly improves responsiveness even before the physical disk heads move.

Overall, Bundled Mode consistently handles high-frequency random requests better than Legacy Mode.

Best regards,

Charles

Since 4K and 16K random writes are smaller than the RAID stripe size, they often trigger Read-Modify-Write (RMW) cycles, making mechanical seek time the primary bottleneck.

How big is your stripe size? Have you tried exporting that size to the higher-level system through UserReportHBAConstraints (probably by setting kIOMinimumSegmentAlignmentByteCountKey)?

I don't know how far it's been pushed, but in theory, the higher-level system is prepared to deal with "arbitrarily" large I/O blocks. I don't know if the performance benefit will justify the memory cost, but that would theoretically eliminate all RMW cycles by simply forcing all I/O to be stripe-aligned.

Even if it required user-level configuration (to push the configuration "in" to your card), I wouldn't be surprised if the performance benefit was large enough to justify considerable "extra" work.

This thread is getting long, so if you want to talk about how you can do this kind of "runtime preference" configuration, please kick off a new thread.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thank you for all the help here! I’ve moved the discussion regarding HBA constraints and alignment optimization to a new thread as suggested:

Optimizing SCSI HBA Constraints and Alignment for DriverKit on Apple Silicon

Best Regards,

Charles

System Panic with IOUserSCSIParallelInterfaceController during Dispatch Queue Configuration
 
 
Q