System Panic with IOUserSCSIParallelInterfaceController during Dispatch Queue Configuration

Question

Created Nov ’25

Replies 28

Boosts 0

Views 1.6k

Participants 2

Hello everyone,

We are in the process of migrating a high-performance storage KEXT to DriverKit. During our initial validation phase, we noticed a performance gap between the DEXT and the KEXT, which prompted us to try and optimize our I/O handling process.

Background and Motivation:

Our test hardware is a RAID 0 array of two HDDs. According to AJA System Test, our legacy KEXT achieves a write speed of about 645 MB/s on this hardware, whereas the new DEXT reaches about 565 MB/s. We suspect the primary reason for this performance gap might be that the DEXT, by default, uses a serial work-loop to submit I/O commands, which fails to fully leverage the parallelism of the hardware array.

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

However, during our implementation attempt, we encountered a critical issue that caused a system-wide crash.

The Operation Causing the Panic:

We configured MyParallelIOQueue using the following combination of methods:

In the .iig file: We appended the QUEUENAME(MyParallelIOQueue) macro after the override keyword of the UserProcessParallelTask method declaration.
In the .cpp file: We manually created a queue with the same name by calling the IODispatchQueue::Create() function within our UserInitializeController method.

The Result:

This results in a macOS kernel panic during the DEXT loading process, forcing the user to perform a hard reboot.

After the reboot, checking with the systemextensionsctl list command reveals the DEXT's status as [activated waiting for user], which indicates that it encountered an unrecoverable, fatal error during its initialization.

Key Code Snippets to Reproduce the Panic:

In .iig file - this was our exact implementation:

class DRV_MAIN_CLASS_NAME: public IOUserSCSIParallelInterfaceController
{
public:
    virtual kern_return_t UserProcessParallelTask(...) override
        QUEUENAME(MyParallelIOQueue);
};

In .h file:

struct DRV_MAIN_CLASS_NAME_IVars {
    // ...
    IODispatchQueue*    MyParallelIOQueue;
};

In UserInitializeController implementation:

kern_return_t
IMPL(DRV_MAIN_CLASS_NAME, UserInitializeController)
{
    // ...
    // We also included code to manually create the queue.
    kern_return_t ret = IODispatchQueue::Create("MyParallelIOQueue",
                                                kIODispatchQueueReentrant,
                                                0,
                                                &ivars->MyParallelIOQueue);
    if (ret != kIOReturnSuccess) {
        // ... error handling ...
    }
    // ...
    return kIOReturnSuccess;
}

Our Question:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

Clarifying this is crucial for all developers pursuing high-performance storage solutions with DriverKit. Any explanation or guidance would be greatly appreciated.

Best Regards,

Charles

Answered by DTS Engineer in 865478022

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

Yeah... that won't work. UserProcessParallelTask is an OSAction target, which is already targeting a queue. I'd be curious to see how the panic() played out*, but I'm not surprised that you panicked.

*I suspect you deadlocked command submission long enough that the SCSI stack gave up and panicked, but that's purely a guess.

That leads to here:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

The answer here is to shift to UserProcessBundledParallelTasks. The architecture is a bit more complex, but it allows you to asynchronously receive and complete tasks in parallel while also reusing command and response buffers to minimize wiring cost. Take a look at the header files from SCSIControllerDriverKit for details on how this architecture works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 1

DTS Engineer OP

Apple

Nov ’25

Recommended

Therefore, to eliminate this bottleneck and improve performance, we configured a dedicated parallel dispatch queue (MyParallelIOQueue) for the UserProcessParallelTask method.

Yeah... that won't work. UserProcessParallelTask is an OSAction target, which is already targeting a queue. I'd be curious to see how the panic() played out*, but I'm not surprised that you panicked.

*I suspect you deadlocked command submission long enough that the SCSI stack gave up and panicked, but that's purely a guess.

That leads to here:

What is the officially recommended and most stable method for configuring UserProcessParallelTask_Impl() to use a parallel I/O queue?

The answer here is to shift to UserProcessBundledParallelTasks. The architecture is a bit more complex, but it allows you to asynchronously receive and complete tasks in parallel while also reusing command and response buffers to minimize wiring cost. Take a look at the header files from SCSIControllerDriverKit for details on how this architecture works.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 2

charles.cc OP

Jan ’26

Hi Kevin,

Thank you for your previous guidance. We have shifted our driver architecture to the UserProcessBundledParallelTasks model with Shared Command/Response Buffers to optimize I/O performance.

But, we are encountering a persistent Kernel Panic / DEXT Crash (Corpse) immediately after switching to the Bundled mode, preventing the target discovery and initialization sequence from completing, while the legacy mode (UserProcessParallelTask) is rock-solid and stable.

Implementation Details:

Memory Mapping: Successfully implemented UserMapBundledParallelTaskCommandAndResponseBuffers. DEXT correctly obtains the virtual addresses for both buffers.
Dispatching: Inside UserProcessBundledParallelTasks, we iterate through the parallelRequestSlotIndices, reading from the shared command buffer and dispatching tasks to the hardware.
Completion: Upon hardware completion, we populate the shared response buffer in the asynchronous path (ISR/Poll) and invoke BundledParallelTaskCompletion with the corresponding OSAction.

The Critical Race Condition Observed:

Due to the very low latency of our hardware for specific commands (e.g., TEST UNIT READY). Logs reveal a severe timing conflict:

The interrupt handler (Completion path) is triggered and successfully invokes BundledParallelTaskCompletion, which returns normally.
Crucially, at this exact microsecond, the original UserProcessBundledParallelTasks call (Submission path) has not yet finished its loop or returned to the system.
Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Troubleshooting Steps Taken (Issue Persists):

Memory Protection: Clamped Sense Data to exactly 18 bytes (SCSI fixed format) to prevent any potential buffer overflows in the shared response buffer.
Array Sizing: Ensured that the index array passed to BundledParallelTaskCompletion is a fixed-size uint16_t[32] to align with the .iig declaration and ensure RPC serialization safety.
Reference Counting: Tested both calling and omitting release() on the OSAction in the completion path. The crash occurs regardless of manual release.
Memory Barriers: Implemented __atomic_thread_fence(__ATOMIC_SEQ_CST) to ensure shared buffer writes are visible before signaling completion.

Questions:

In Bundled mode, does DriverKit support a scenario where the Completion RPC returns before the Submission RPC has finished its execution? Does this cause an internal state-machine conflict within IOUserSCSIParallelInterfaceController?
What is the exact ownership model for the OSAction * action in Bundled mode? Is the action "consumed" by BundledParallelTaskCompletion, or is the driver still responsible for retain/release management as in legacy mode?

Any advice on how to resolve this would be very helpful. Thanks!

Best Regards,

Charles

Answer 3

DTS Engineer OP

Apple

Jan ’26

Due to the very low latency of our hardware for specific commands (e.g., TEST UNIT READY). Logs reveal a severe timing conflict:

The interrupt handler (Completion path) is triggered and successfully invokes BundledParallelTaskCompletion, which returns normally.

Crucially, at this exact microsecond, the original UserProcessBundledParallelTasks call (Submission path) has not yet finished its loop or returned to the system.

Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Something doesn't sounds right here. There are actually a few different things that concern me:

(1)
There's a reasonably long "cycle" to "get" to your interrupt handler from the kernel. I'm skeptical that there's any way for an interrupted to fire before UserProcessBundledParallelTasks finishes running unless UserProcessBundledParallelTasks is doing "something" that substantially delays its own process.

(2)
From the kernel side, how bundled commands actually "work" is that ProcessParallelTask claims a slot for each command, preps the command, then passes that command over to a secondary dispatch "engine" which actually ends up calling into UserProcessBundledParallelTasks(). This lets multiple commands be enqueued in the time gap that delays commands reaching your driver, which is where most of the performance benefit comes from.

However, it also means that the kernel driver is always "done" with any given command WELL before it ever reaches your DEXT. In theory, I think you could immediately complete any given command, directly inside UserProcessBundledParallelTasks, since all of the "bookkeeping" (which UserCompleteBundledParallelTask will manipulate) was actually done BEFORE UserProcessBundledParallelTasks() was called at all.

(3)
In the standard configuration, I believe UserProcessBundledParallelTasks and UserCompleteBundledParallelTask implicitly target the default dispatch queue, so that can't actually be called at the "same" time. Note that this is NOT in fact a meaningful performance bottleneck, as the implementation of both methods is simple enough that they shouldn't have any meaningful effect on each other.

That leads to here:

Immediately after both paths eventually return, the DEXT process crashes (Corpse), subsequently triggering a Kernel Panic.

Crashes how? Do you have a crash log for your DEXT? Or is it present in the kernel panic thread capture?

FYI, there's a forum post here and then here that outlines how to fully symbolicate a kernel panic, including all system threads. If your DEXT is still "live", that would show you what it was doing.

What is the exact ownership model for the OSAction * action in Bundled mode? Is the action "consumed" by BundledParallelTaskCompletion, or is the driver still responsible for retain/release management as in legacy mode?

Our own driver implements UserProcessBundledParallelTasks() by building a "local" SCSIUserParallelTask object for each incoming command, then calling into its existing "UserProcessParallelTask" method to handle the actual commands, so it ends up retaining/releasing the completion once for every command that comes in.

Having said that, if you check the actual value you're receiving, I think you'll find that the actual "byte" value you're receiving is always the same, because it's actually a single action that's being reused for every command, so I'm not sure it really matters all that much.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 4

charles.cc OP

Jan ’26

Hi Kevin,

Thank you for your valuable insights. Following your advice, we have refactored our driver to use a "wrapper" pattern where UserProcessBundledParallelTasks serves as a high-performance entry point that forwards commands to our core dispatch logic.

To eliminate potential race conditions, we have also moved our interrupt handling to the Default Dispatch Queue using kIOServiceDefaultQueueName. This ensures that command submission and completion are strictly serialized.

Here are the key implementation details showing how we unified the dispatch logic:

1. Legacy Entry Point (Single Task):

We extracted our core logic into a helper method, DispatchTaskInternal, passing 0xFFFF as a placeholder for the Slot Index.

kern_return_t MyDriver::UserProcessParallelTask_Impl(
    SCSIUserParallelTask parallelRequest,
    uint32_t *response,
    OSAction *completion)
{
    // Forward to unified internal dispatcher with no slot index (Legacy Mode)
    return DispatchTaskInternal(parallelRequest, response, completion, 0xFFFF);
}

2. Bundled Submission Path:

In our UserProcessBundledParallelTasks implementation, we iterate through the indices, create a local object copy of the request (as you suggested), and call the same internal dispatcher.

void MyDriver::UserProcessBundledParallelTasks_Impl(
    const uint16_t parallelRequestSlotIndices[32],
    uint16_t parallelRequestSlotIndicesCount,
    OSAction * completion)
{
    for (uint16_t i = 0; i < parallelRequestSlotIndicesCount; i++) {
        uint16_t slotIndex = parallelRequestSlotIndices[i];
        const SCSIUserParallelTask& sharedReq = ivars->fCommandBuffers[slotIndex];
        
        // Create a local copy to utilize existing processing logic
        SCSIUserParallelTask localReq = sharedReq;
        uint32_t response;
        
        // DispatchTaskInternal performs a completion->retain() for EVERY command,
        // ensuring the action stays alive regardless of the bundled count.
        DispatchTaskInternal(localReq, &response, completion, slotIndex);
    }
}

3. Unified Asynchronous Completion Path:

Regardless of the mode, we now use the stable ParallelTaskCompletion within the interrupt handler (ISR), followed by the corresponding release().

// Inside the Asynchronous Interrupt Handler (serialized on Default Queue)
SCSIUserParallelResponse taskResponse = {};
// ... Populate status, TaskID, and Version ...

// Signal completion to the framework
ParallelTaskCompletion(completion, taskResponse);

// Balance the reference count performed during submission
completion->release();

Observations and Current Bottleneck:

If we intercept a command (like TEST UNIT READY) and call the completion logic "synchronously" within the UserProcessBundledParallelTasks scope, the system works perfectly and immediately issues the next command (e.g., INQUIRY).

As soon as the completion is triggered from the asynchronous interrupt path—even though it is serialized on the same Default Queue—invoking ParallelTaskCompletion (or BundledParallelTaskCompletion) causes the DEXT to crash immediately (Corpse / Address size fault).

Questions:

Since we have ensured an independent retain/release cycle for every command and used a serialized queue, why does the OSAction handle appear to become invalid once it enters an asynchronous callback context in Bundled mode? Is there a hidden thread-context or lifecycle restriction on the OSAction provided during a bundled call that differs from legacy mode?

We have verified that the legacy mode remains 100% stable under the same serialized queue configuration. We would appreciate any further guidance on why the "Sync succeeds, Async crashes" behavior persists in the high-performance path.

Best Regards,

Charles

Answer 5

DTS Engineer OP

Apple

Jan ’26

To eliminate potential race conditions, we have also moved our interrupt handling to the Default Dispatch Queue using kIOServiceDefaultQueueName.

I'm not sure you want to do this, as I think it might end up causing BundledParallelTaskCompletion() to deadlock itself.

As soon as the completion is triggered from the asynchronous interrupt path—even though it is serialized on the same Default Queue—invoking ParallelTaskCompletion (or BundledParallelTaskCompletion) causes the DEXT to crash immediately (Corpse / Address size fault).

Can you post the crash log for this?

We extracted our core logic into a helper method, DispatchTaskInternal, passing 0xFFFF as a placeholder for the Slot Index.

Is your "DispatchTaskInternal" method just a simple helper method/function or is it actually running on a different queue? If it's on a different queue, then I would strongly suggest that you eliminate that entirely and just do "everything" on the same thread.

The issue here is that, in practice, most of what your driver actually "does" takes VERY little time (short enough intervals that even measuring it is somewhat difficult), which can easily create a situation where the process of moving work between threads is actually taking FAR more time than the work itself. The net result is both slower and more complicated, without any real benefit.

In our UserProcessBundledParallelTasks implementation, we iterate through the indices, create a local object copy of the request (as you suggested), and call the same internal dispatcher.

That all looks reasonable.

Regardless of the mode, we now use the stable ParallelTaskCompletion within the interrupt handler (ISR).

That's incorrect. You MUST use BundledParallelTaskCompletion with the bundled I/O path. Note that, just like work arrives in "clusters", it's expected that your interrupt handler will be able to process multiple commands every time it fires, so that BundledParallelTaskCompletion can clear multiple slots at a time.

As soon as the completion is triggered from the asynchronous interrupt path—even though it is serialized on the same Default Queue—invoking ParallelTaskCompletion (or BundledParallelTaskCompletion) causes the DEXT to crash immediately (Corpse / Address size fault).

So, I spent some more time looking at our driver that implements BundledParallelTaskCompletion and I'm increasingly convinced that it's unintentionally over-retaining the action handler. So, three request/suggestions:

Please file a bug asking us to clarify how BundledParallelTaskCompletion should be handled and post the bug number back here.
Check that value you're receiving for "completion" in UserProcessBundledParallelTasks(). I think you'll find that you're always receiving the same value, but I'd like to confirm that.
For basic testing purposes, try over retaining completion and retest. If you want to maintain your existing retain/release code, just add an extra retain at the start of UserProcessBundledParallelTasks().

That should sort out whether or not the management of action itself is directly involved here.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 6

charles.cc OP

Jan ’26

Hi Kevin,

Thank you for your detailed feedback and for spending time looking at your internal driver implementation. I have followed your suggestions and filed a formal bug report.

1. Bug Report Filed

I have filed a bug report via Feedback Assistant.

Feedback ID: FB21636775

I have attached the original and symbolicated crash logs, along with reduced code snippets showing our implementation.

2. OSAction Pointer Confirmation

I have verified the value of the completion pointer received in UserProcessBundledParallelTasks. As you suspected, the pointer address is identical for all commands within a single bundle.

3. Crash Log Insights

The symbolicated crash log confirms that the panic is triggered by an __assert_rtn inside OSMetaClassBase::QueueForObject during the call to completion.

Specifically, when we attempted the "Unified Path" (using legacy ParallelTaskCompletion for bundled commands) as a stability test, it triggered an immediate panic, which confirms your point that we MUST use BundledParallelTaskCompletion for the bundled path.

Crashed Thread Backtrace (Summary):

Thread 4 Crashed:: Dispatch queue: DriverKitAcxxx-Default
...
4   DriverKit                     0x19e255fb8 __assert_rtn + 88
5   DriverKit                     0x19e25629c OSMetaClassBase::QueueForObject(...)
6   DriverKit                     0x19e2267bc OSMetaClassBase::QueueForObject(...)
7   DriverKit                     0x19e226fac OSMetaClassBase::Invoke(IORPC) + 476
8   SCSIControllerDriverKit       0x19e373364 IOUserSCSIParallelInterfaceController::ParallelTaskCompletion(...)
9   <BundleID>                0x1003156a4 MyDriver::InterruptHandler(...)

Next Steps:

I am now proceeding with the "over-retaining" experiment you suggested (adding an extra retain at the start of the bundled submission loop) to see if it resolves the asynchronous crash when using BundledParallelTaskCompletion. I will update you with those results shortly.

Best Regards,

Charles

Answer 7

charles.cc OP

Jan ’26

Hi Kevin,

Thank you for the suggestion. We have performed the "over-retaining" experiment by adding an extra retain() to the OSAction at the start of the UserProcessBundledParallelTasks loop.

Unfortunately, the result remains the same: the DEXT still triggers a Corpse crash immediately upon calling the completion signal in the asynchronous path.

As the legacy mode (UserProcessParallelTask) remains 100% stable under our newly serialized queue configuration, we have decided to revert to the legacy path for now to continue our product development while we wait for investigation on the bug report (FB21636775).

Any further insights into why the asynchronous completion fails specifically in Bundled mode would be greatly appreciated.

Best Regards,

Charles

Answer 8

DTS Engineer OP

Apple

Jan ’26

Bug Report Filed

I have filed a bug report via Feedback Assistant.

Clarifying things, what I was specifically looking for here was a bug asking us to document exactly how the action for BundledParallelTaskCompletion should be managed. The issue here is that its usage semantics are somewhat... weird. You're given a single action pointer, but you're expected to use that action pointer multiple times (since individual commands won't complete at the same time), which makes its lifetime somewhat... odd.

In practice, I don't think this actually matters. As you've noted, you always receive the same pointer so as long as you don't release it too many times everything will be fine. For "maximum" correctness, you could do something like this:

Retain it once (or a few times) the first time you receive it in UserProcessBundledParallelTasks()
Release it at some late point in the teardown process when you "know" all I/O is "done".
Add an assert in UserProcessBundledParallelTasks() to validate that it never changes (which would be a major shift in behavior).

...but, again, EXACTLY how you do this doesn't actually matter all that much, as long as you don't over release it. For example, if you don't do #2 then the system won't "technically" release the action... which just means it'll be destroyed slightly later when your process is destroyed at the end of DEXT teardown.

In other words, we need to document what you "should" do, but that doesn't mean this is the source of your actual problem.

Shifting to the bug you did file:

Feedback ID: FB21636775

All the crash logs in that bug show you calling ParallelTaskCompletion, not BundledParallelTaskCompletion, which is simply wrong. Please upload crash logs showing the failure when you call BundledParallelTaskCompletion.

That leads to here:

Specifically, when we attempted the "Unified Path" (using legacy ParallelTaskCompletion for bundled commands),

I think you read more than I intended into my original post. My point there was not that you can/should just reuse all of your same code, it was specifically noting that it was probably possible to reuse your UserProcessParallelTask method to process each of the commands you receive through UserProcessBundledParallelTasks. That does NOT mean that the entire I/O chain is identical, only that the initial process happens to be the same/very similar.

Which leads to here:

Any further insights into why the asynchronous completion fails specifically in Bundled mode would be greatly appreciated.

I can't be certain of this, but the backtrace you're seeing seems like the kind of failure that would happen if you passed an action of one type ("BundledParallelTaskCompletion") into a different completion caller ("ParallelTaskCompletion").

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 9

charles.cc OP

Jan ’26

Hi Kevin,

Thank you for your candid feedback. You are absolutely correct—the crash log I previously shared was from an experimental attempt where we mistakenly used the legacy ParallelTaskCompletion API for bundled commands. I apologize for the confusion.

I have now strictly followed your guidance, and here is the latest update:

1. Corrected Bug Report & Logs

I have updated the Bug Report (FB21636775) with a new symbolicated crash log.

This log definitively captures the failure while invoking the correct BundledParallelTaskCompletion API (as shown in Frame 8 of the trace).

Even with the correct API, the system still triggers an __assert_rtn followed by an Address size fault (ESR: 0x56000080).

2. Over-retaining Experiment Results (Hard Reset)

I attempted the "over-retaining" experiment you suggested (adding an extra retain() at the start of the loop and omitting release() in the ISR).

The result was critical: it triggered an immediate Kernel Panic / Hard Reset every time.

Because the system resets so rapidly, no .ips crash log is generated for this specific test case, and NVRAM is cleared. This is a much more severe failure compared to the DEXT-only crash we saw previously.

3. The "Sync vs. Async"

A key observation remains: if we invoke BundledParallelTaskCompletion synchronously within the UserProcessBundledParallelTasks submission loop (intercepting the TUR command), it works perfectly and the kernel immediately issues the next command.

The failure only occurs when the completion is triggered from the asynchronous interrupt handler, despite the fact that our queues are now serialized on the Default Dispatch Queue.

We have reverted to the Legacy path for now to maintain stability for other development tasks while we wait for your further guidance on these findings.

Best Regards,

Charles

Answer 10

DTS Engineer OP

Apple

Jan ’26

I have updated the Bug Report (FB21636775) with a new symbolicated crash log.

OK. I have a totally new theory. Is it possible you're having an interrupt fire before you've completed full initialization, so you end up shuffling uninitialized data into the system? The crash log shows that you're in "IOUserSCSIParallelInterfaceController::UserCreateTargetForID", so this is still very early in the startup process.

I ask this, because an issue like this would explain both of these:

(1)

I attempted the "over-retaining" experiment you suggested (adding an extra retain() at the start of the loop and omitting release() in the ISR).

Over retaining a valid object shouldn't actually "do" anything "meaningful", as all you're doing is incrementing a simple integer counter. Eventually you might cause the counter to overrun, but that's not something that would happen "immediately". It certainly shouldn't cause a kernel panic...

The result was critical: it triggered an immediate Kernel Panic / Hard Reset every time.

...unless the thing you called "retain" on WASN'T a valid object and/or wasn't the object you thought it was, at which point you're now taking a long walk off a short vtable.

(2)

A key observation remains: if we invoke BundledParallelTaskCompletion synchronously within the UserProcessBundledParallelTasks submission loop (intercepting the TUR command), it works perfectly and the kernel immediately issues the next command.

This is the one that really got my attention. The difference here isn't just that it's synchronous, it's that having the call originate from UserProcessBundledParallelTasks means that, by definition, you'll have completed initialization and have a valid action object, since UserProcessBundledParallelTasks is where you GET that object in the first place.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 11

charles.cc OP

Jan ’26

Hi Kevin,

Thank you for your deep insight. Your theory matches our observed facts perfectly!

I have just updated FB21636775 with the latest symbolicated crash log captured today (Jan 19).

1. Evidence of Concurrent Execution Conflict

From the symbolicated backtrace, it is clear that at the moment of the crash:

Thread 1 (AuxiliaryQueue): Is in the middle of executing UserCreateTargetForID (Frame 12). This RPC call has not yet returned to our DEXT.
Thread 4 (Crashed Thread): Has already received the hardware interrupt for the first command (TEST UNIT READY) and is attempting to invoke the completion API.

2. Crash Characteristic Analysis

The system encountered an Address size fault (Null dereference) at address 0x0000000000000008. This confirms your deduction: because UserCreateTargetForID is still pending on the AuxiliaryQueue and has not returned, the target-related objects or OSAction metadata in the kernel are not yet fully initialized. Attempting to invoke the action from an asynchronous thread leads to invalid memory access.

This also explains why our "synchronous test" succeeded: within the scope of UserProcessBundledParallelTasks, the system ensures that the target is already fully established and the action pointer is valid.

Questions:

In the DriverKit framework, what is the recommended best practice for handling this specific race where hardware completion arrives before the target creation RPC returns? Should the kernel allow probe commands to be dispatched to hardware before UserCreateTargetForID has officially returned success to the DEXT?

We look forward to your guidance, as resolving this is critical for the system stability of our high-performance path.

Best Regards,

Charles

Answer 12

DTS Engineer OP

Apple

Jan ’26

SO, I actually think this is what you need to take a much closer look at:

Thread 4 (Crashed Thread): Has already received the hardware interrupt for the first command (TEST UNIT READY) and is attempting to invoke the completion API.

More specifically, why EXACTLY did that hardware interrupt occur? The assumption you seem to be making is that the system called "UserProcessBundledParallelTasks" and then wasn't "ready" to process the completion when your interrupt fired. However, I think what actually happened is slightly different and that "UserProcessBundledParallelTasks" wasn't actually called at all, and that the interrupt handler fired for other reasons.

That leads to my comment here:

This also explains why our "synchronous test" succeeded: within the scope of UserProcessBundledParallelTasks, the system ensures that the target is already fully established and the action pointer is valid.

Strictly speaking, that's true but not in the way you're thinking. What's actually going on here is that the system first does all of its "work", THEN calls "UserProcessBundledParallelTasks". There's nothing specifically magic/special about being "inside" UserProcessBundledParallelTasks, it simply guarantees you're "after" the point where the system was "done" with the task.

Similarly, the analysis here:

The system encountered an Address size fault (Null dereference) at address 0x0000000000000008. This confirms your deduction: because UserCreateTargetForID is still pending on the AuxiliaryQueue and has not returned, the target-related objects or OSAction metadata in the kernel are not yet fully initialized.

This isn't about kernel-level initialization, it's about your OWN structures initialization. If you count the bytes of the structure you're using to track this data, I think you'll find that your OSAction is located 8 bytes into that structure, meaning the structure point itself was "null".

In the DriverKit framework, what is the recommended best practice for handling this specific race where hardware completion arrives before the target creation RPC returns?

So, a few different points here:

I don't think this I/O request is coming from UserProcessBundledParallelTasks(...). I don't know where it IS coming from, but I don't think it's coming from UserProcessBundledParallelTasks(...).
By definition, you can't call BundledParallelTaskCompletion on work that didn't arrive through UserProcessBundledParallelTasks(), since you can't/won't have the necessary "data" to process the request.
Looking at our driver as a reference, it used a boolean to enable/disable bundled I/O. It disables bundled I/O in "start", then enables bundled when it's finished all work and is returning from UserMapBundledParallelTaskCommandAndResponseBuffers(...). I think this ends up providing the same "guarantee" as #2 above since, by definition, we can't call UserProcessBundledParallelTasks until we've called UserMapBundledParallelTaskCommandAndResponseBuffers. Before that point, (bundle I/O enable), it completes commands through the standard ParallelTaskCompletion().

My overall point here is that the correct way to handle a given I/O request depends on how that I/O arrived. If the I/O request that's failing arrived through UserProcessBundledParallelTasks, then that's a serious issue that I'd need to take a much closer look at. However, I don't think that's what actually happened. I think what actually happened is that the I/O arrived through a different route (possibly UserProcessParallelTask, possibly something your DEXT "directly" did) and you're incorrectly completing it through BundledParallelTaskCompletion().

Finally, on this point:

Should the kernel allow probe commands to be dispatched to hardware before UserCreateTargetForID has officially returned success to the DEXT?

I believe that it's actually guaranteed that it WILL send commands before UserCreateTargetForID() returns. The documentation for UserCreateTargetForID covers the serialization issues between UserCreateTargetForID and UserInitializeTargetForID:

"As part of the UserCreateTargetForID call, the kernel calls several APIs like UserInitializeTargetForID which run on the default dispatch queue of the device. Synchronously calling UserCreateTargetForID from the default dispatch queue blocks the default dispatch queue until UserCreateTargetForID finishes. Subsequent calls from the kernel like UserInitializeTargetForID won’t have a chance to run on the default queue, leading to a deadlock."

And UserInitializeTargetForID says:

"The host bus adapter (HBA) can use this method to probe the target or do anything else necessary before IOKit registers the device object for matching."

...which would presumably involve some amount of I/O.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 13

charles.cc OP

Feb ’26

Hi Kevin,

During the implementation of batch task buffer mapping, we observed an inconsistency between API return values and actual memory addresses, which leads to an immediate system Panic (Kernel Data Abort).

Inside the implementation of UserMapBundledParallelTaskCommandAndResponseBuffers, we performed the following operations on the Response Buffer:

We invoked CreateMapping(0, 0, 0, 0, 0, &ivars->fResponseMap) on the provided parallelResponseIOMemoryDescriptor.
CreateMapping returns kIOReturnSuccess, and the resulting ivars->fResponseMap object pointer is non-null.
However, a subsequent call to ivars->fResponseMap->GetAddress() returns NULL (0x0).

After the DEXT returns from UserMapBundledParallelTaskCommandAndResponseBuffers with this NULL address, the system immediately triggers a Panic when the kernel attempts to process subsequent discovery commands (e.g., Inquiry).

Panic Type: Kernel data abort
Fault Address (FAR): 0x0000000000000000
Exception Class (ESR): 0x96000006 (Translation Fault, Level 0)
Faulting Module: com.apple.iokit.IOSCSIArchitectureModelFamily
Impact: The system enters a reboot loop until the hardware link is physically disconnected.

Questions

Recommended Handling for Success with NULL Address: What is the correct error-handling mechanism when CreateMapping succeeds but GetAddress() returns NULL? Does this indicate a specific alignment restriction, memory fragmentation, or a resource conflict on Apple Silicon?
Synchronization for Memory Visibility: To ensure that the virtual address mapping established in UserMap... is globally visible across all CPU cores and execution contexts before the kernel issues the first Bundled Task, are there recommended synchronization primitives or Memory Barrier calls?
OSAction Reference Counting in Bundled Mode: Since multiple task slots share a single OSAction in Bundled Mode, should the DEXT perform manual retain/release operations on this shared action to align with the framework's lifecycle management?
Implementation Strategy for Discovery Transition: Is there a recommended pattern for transitioning from the Discovery Phase (Inquiry/Report LUNs) to Bundled Mode? If the DEXT returns an error during UserMap... due to addresses not being ready, what is the impact on the possibility of re-entering Bundled Mode later?

Best Regards,

Charles

Attached Supporting Data

Please find the complete Panic Log attached for your review.

Attachment: FB21636775_Panic_Report_Verbose_Mode.txt

FB21636775_Panic_Report_Verbose_Mode.txt

Answer 14

DTS Engineer OP

Apple

Feb ’26

Part 1...

However, a subsequent call to ivars->fResponseMap->GetAddress() returns NULL (0x0).

Subsequent call "when"? The expectation (and what our driver does) is that you'd immediately call GetAddress() and then basically "never" look at the map again. In one of our drivers, that "never" is quite literal. The code is basically:


IOMemoryMap               *memoryMap      = NULL;
if ( parallelCommandIOMemoryDescriptor->CreateMapping(0, 0, 0, 0, 0, &memoryMap) == kIOReturnSuccess ) {
	ivars->fCommandAddress = memoryMap->GetAddress();
}
...
if ( parallelResponseIOMemoryDescriptor->CreateMapping(0, 0, 0, 0, 0, &memoryMap) == kIOReturnSuccess ) {
	ivars->fParallelResponseAddress = memoryMap->GetAddress();
}

...and, yes, that code leaks two IOMemoryMap's. I don't know what the exact thinking was, but I suspect they realized that the only reason that mapping would ever become invalid/useless was because the driver was being torn down, so freeing doesn't really matter. I'll admit, there is a certain charm to that, in that you don't need to figure out the "right" place for free memory if you never free it.

Following on from there, just in case this wasn't clear, both of those shared buffers are basically accessed as shared arrays of structs. So the look up for the command request looks like:

parallelRequest = ( SCSIUserParallelTask * ) ( ( uint8_t * ) ( ivars->fCommandAddress ) + sizeof( SCSIUserParallelTask ) * slotIndex );

In the case of fCommandAddress, it's only used in two places— once in UserMapBundledParallelTaskCommandAndResponseBuffers when it's mapped and again in UserProcessBundledParallelTasks when it's used to retrieve the new task. fParallelResponseAddress works exactly the same way, though you may end up using it in a few more places beyond direct command completion (for example, if you’re flushing commands as part of handling sleep).

Moving to here:

Recommended Handling for Success with NULL Address: What is the correct error-handling mechanism when CreateMapping succeeds but GetAddress() returns NULL?

Do you call “GetAddress" inside UserMapBundledParallelTaskCommandAndResponseBuffers? I'm not sure of exactly what's going on, but I suspect it's related to this comment in IOMemoryMap:

"An IOMemoryMap object doesn’t own the memory it references, and you must not attempt to free that memory."

And this comment from the IOMemoryMap.iig header:

* To allocate memory for I/O or sharing, use IOBufferMemoryDescriptor::Create()
* Methods in this class are used for memory that was supplied as a parameter.

I think what's actually going on here is that the underlying IOMemoryDescriptor you receive in the argument list is what ultimately “owns" the memory you received, and the handling of "your" memory is actually tied to that object. The problem here is that you "lost" access to that object when you returned from UserMapBundledParallelTaskCommandAndResponseBuffers without retaining either of the descriptors you received. When those objects were destroyed, that also wiped out your IOMemoryMap (this may also be why our code retained them).

Summarizing all that:

If you want to formally "hold" memory, you need to hold on to both the IOMemoryDescriptor AND the IOMemoryMap. Just holding the IOMemoryMap won't necessarily "work".
In this PARTICULAR case, I'd ignore this whole issue by calling GetAddress once and using it forever.

Expanding on that last point, the access pattern of the bundled I/O system means you end up entangled with the kernel. That is, the kernel COULD invalidate the buffer at any time, which, theoretically, would crash your driver. However,it can't/won't do that until it “knows" it's not going to send you anymore I/O. Similarly, your driver accessing that buffer after I/O ended would crash your driver... except you already promised not to do that:

https://developer.apple.com/documentation/scsicontrollerdriverkit/iouserscsiparallelinterfacecontroller/usermapbundledparalleltaskcommandandresponsebuffers

The framework owns the shared buffer slots for both commands and responses until it passes ownership to the dext in the UserProcessBundledParallelTasks call. From that point, the dext has ownership of these buffer slots until it returns ownership back to the framework in BundledParallelTaskCompletion. Don’t access a command or response buffer slot until the framework passes ownership to your dext.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 15

DTS Engineer OP

Apple

Feb ’26

Part 2...

Does this indicate a specific alignment restriction, memory fragmentation, or a resource conflict on Apple Silicon?

No, I don't think anything like that is going on.

Synchronization for Memory Visibility: To ensure that the virtual address mapping established in UserMap... is globally visible across all CPU cores and execution contexts before the kernel issues the first Bundled Task, are there recommended synchronization primitives or Memory Barrier calls?

No, I don't think any special synchronization is required. If you're tracking the buffer through a direct pointer, then the relative gap between UserMapBundledParallelTaskCommandAndResponseBuffers and the first possible call to BundledParallelTaskCompletion is so large that any kind of synchronization is unnecessary. This is also a fine reason to not bother holding/freeing the IOMemoryMap— if you never free your buffer pointer, then its values will only EVER be "NULL" (uninitialized) or its valid value (post initialized). The more I think about this "leak", the more I like it...

OSAction Reference Counting in Bundled Mode: Since multiple task slots share a single OSAction in Bundled Mode, should the DEXT perform manual retain/release operations on this shared action to align with the framework's lifecycle management?

I believe you filed a bug asking about this earlier, and I have my own bug on this (r.169737319). It fell off my radar for a bit, but I've asked the team for some guidance on the "right" way to handle this today. However, the technical answer is that as long as you don't over-release the action, it doesn't matter.

Implementation Strategy for Discovery Transition: Is there a recommended pattern for transitioning from the Discovery Phase (Inquiry/Report LUNs) to Bundled Mode?

I think the basic answer here is that your DEXT should complete commands with the flow it receives them. In practice, what will actually happen is that you may receive some commands through UserProcessParallelTask and then everything will shift to UserProcessBundledParallelTasks, but I don't think it's that difficult to design a driver that could (theoretically) mix the two paths arbitrarily.

However:

If the DEXT returns an error during UserMap... due to addresses not being ready,

In practice, I don't think UserMapBundledParallelTaskCommandAndResponseBuffers should ever fail. I think the GetAddress issue you ran into above is a somewhat odd edge case that will never occur with correct usage patterns, particularly given the very simple flow I outlined above. Certainly "GetAddress" returning NULL is NOT normal.

what is the impact on the possibility of re-entering Bundled Mode later?

I don't think this is possible. The bundled I/O transition happens as part of the initial configuration inside of start(), and there isn't any way to trigger it again.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 16

DTS Engineer OP

Apple

Feb ’26

I believe you filed a bug asking about this earlier, and I have my own bug on this (r.169737319). It fell off my radar for a bit, but I've asked the team for some guidance on the "right" way to handle this today.

Following up on myself after talking with the team, here is how I would suggest handling this:

In the first call to UserProcessBundledParallelTasks, you should:

Store the OSAction you receive into your DEXTs own ivars.
Intentionally retain() that OSAction. This retain will NOT be balanced, so you're intentionally over-retaining the OSAction (it will be destroyed when your DEXT is). You can actually retain it a few times if you want.
On all future calls to UserProcessBundledParallelTasks, assert that the OSAction you receive is the same as the action you received in #1, intentionally crashing if it changes.

Note that the point of #3 is NOT to detect a valid state you should anticipate or "handle". It's purely there as an overall "safety" check that will either never trigger or will trigger years from now for some totally reason. On that second point, I'll also note that this is an EXCELLENT place to add extended comments to save the poor fellow[1] who is lost trying to figure out why you did this years from now.

In any case, if that assert ever triggers, that either indicates that we've changed something fundamental to the overall architecture or a bug in your DEXT has damaged/altered the action you're supposed to be using. Either way, it's better to crash at that point instead of attempting to function in a totally unexpected state.

At some point I hope we'll address and document this (I've suggested making the action a singleton object that disables retain/release) but I'd expect anything we do here to work fine with the flow above.

[1] Always remember, the poor fellow you save might just be you.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 17

charles.cc OP

Feb ’26

Hi Kevin,

Thank you for the guidance. Based on your suggestions, we have implemented architectural modifications. Below is the status of implementation and the current issue.

We store and retain IOMemoryDescriptor and IOMemoryMap objects in ivars. The ISR accesses the shared buffer address, which resolved the issue where GetAddress() returned NULL.

The ISR differentiates between command sources. In Bundled mode, the DEXT calls BundledParallelTaskCompletion without calling release(). In Legacy mode, it calls ParallelTaskCompletion followed by release(). These changes eliminated the 0x92000006 Kernel Panic and DEXT Corpse crashes.

The kernel dispatches Bundled commands immediately after UserInitializeTargetForID returns, but before UserCreateTargetForID completes. We found that reporting command completion while UserCreateTargetForID is still executing causes the UserCreateTargetForID call to hang.

Based on this behavior, we infer a re-entrancy deadlock. The registration thread waits for the SAM target object to be fully instantiated, while the DEXT simultaneously reports an I/O completion for that same target. We believe this causes a locking conflict within the kernel state machine. Since UserCreateTargetForID does not return, the serial discovery queue is blocked, stopping all subsequent target registrations.

Is our understanding correct? How should we resolve the issue we are currently facing?

Best Regards,

Charles

Answer 18

charles.cc OP

Feb ’26

Hi Kevin,

Following up on my previous update regarding the registration hang and SAM layer panic. We performed further experiments using the Selection Timeout (SERVICE_DELIVERY_FAILURE) approach as you suggested. Below are the results:

1. Selection Timeout Experiment Results

We modified the DEXT to report SERVICE_DELIVERY_FAILURE immediately for Bundled commands arriving before the registration returns. We confirmed fControllerTaskIdentifier matches the request.

Stability Improvement: With this change, any attempt to unplug the hardware or deactivate the DEXT no longer triggers a Kernel Panic. Resource lifecycle management (retaining/releasing descriptors) is now functioning correctly.
Persistent Hang: Despite reporting the timeout, UserCreateTargetForID remains hung indefinitely and never returns on its own.

2. Log Evidence: The "Unlock" Mechanism

The logs show that the kernel registration thread is blocked until a termination signal is received.

Log A: Hang after Selection Timeout

default 14:00:07.773080  kernel  [AsyncCreateTargetForID_Impl] Calling UserCreateTargetForID for LUN 0
default 14:00:07.773519  kernel  [UserInitializeTargetForID_Impl] Target 0 callback success.

// DEXT reports Selection Timeout via Bundled API
default 14:00:07.774781  kernel  [UserProcessBundledParallelTasks_Impl] [Guard] Target 0 not ready. Reporting Selection Timeout.
default 14:00:07.774789  kernel  [UserProcessBundledParallelTasks_Impl] // --- } UserProcess Return

// DEADLOCK: No further output. AsyncCreateTargetForID_Impl execution is interrupted.

Log B: Stop Sequence Unblocking the Queue

Upon issuing the Deactivate command, the original UserCreateTargetForID for LUN 0 returns success immediately after Stop() starts. The serial queue then proceeds to subsequent LUNs.

default 14:02:33.557678  kernel  [Stop_Impl] // { ---
default 14:02:33.558307  kernel  [Stop_Impl_block_invoke] All cancels finished. Calling super::Stop.

// Registration for LUN 0 finally returns success
default 14:02:33.558402  kernel  [AsyncCreateTargetForID_Impl] Target 0 fully registered.
default 14:02:33.558406  kernel  [AsyncCreateTargetForID_Impl] // --- } LUN 0 Exit

// Serial queue proceeds; LUN 1 and LUN 2 fail with kIOReturnAborted during termination
default 14:02:33.558460  kernel  [AsyncCreateTargetForID_Impl] Calling UserCreateTargetForID for LUN 1
default 14:02:33.558484  kernel  [AsyncCreateTargetForID_Impl] Target 1 registration failed: 0xe00002bc
default 14:02:33.558559  kernel  [AsyncCreateTargetForID_Impl] Calling UserCreateTargetForID for LUN 2
default 14:02:33.558634  kernel  [AsyncCreateTargetForID_Impl] Target 2 registration failed: 0xe00002bc

3. Request for Guidance

Our data confirms that reporting either SUCCESS or FAILURE during registration does not allow UserCreateTargetForID to return under normal conditions. It only returns when the DEXT enters its termination phase.

Is there a specific signal, barrier, or status required in Bundled mode to notify the SAM layer that the probe is complete and allow UserCreateTargetForID to return normally? Or should we handle these initial commands differently to avoid this deadlock?

Best Regards,

Charles

Answer 19

DTS Engineer OP

Apple

Feb ’26

First off, revisiting my own comments:

No, I don't think any special synchronization is required. If you're tracking the buffer through a direct pointer, then the relative gap between

Please check your email and the bug site for some specific feedback from the engineering team, as they have some comments specific to your code they wanted to pass back. I don't think they'll change the immediate issue but they're worth noting and integrating.

Is there a specific signal, barrier, or status required in Bundled mode to notify the SAM layer that the probe is complete and allow UserCreateTargetForID to return normally?

No. You're creating the deadlock, not the SAM layer.

Or should we handle these initial commands differently to avoid this deadlock?

What else is attached to the queue UserCreateTargetForID is called on? The answer should be "nothing", as my guess is that you're deadlocking on whatever call looped "back" to UserCreateTargetForID().

In terms of the expected call chain, looking at our code, I'd expect:

UserInitializeTargetForID
UserDoesHBASupportMultiPathing
SAM stack issue INQUIRY

In terms of #3, our code has multiple retries (~8) on most of the code paths I've seen, so if you're only seeing one call, then I think the issue is actually that you're unable to complete the command, presumably because you deadlocked yourself.

The logs show that the kernel registration thread is blocked until a termination signal is received.

This is somewhat misleading. I think what's actually happening here is that kernel side tear down is severing your DEXT connections to the kernel, which basically ends up failing "everything".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 20

charles.cc OP

Feb ’26

Hi Kevin,

Thank you for your feedback and the feedback from the engineering team. We have integrated all suggestions from your forum posts (ID: 875288022, 875587022) and the Bug Report, and have conducted a full round of testing. Below is our current status.

We now perform a memset on SCSIUserParallelResponse in the ISR and correctly populate the version, fControllerTaskIdentifier, and fBytesTransferred fields. The ISR now differentiates between Bundled/Legacy modes, calling BundledParallelTaskCompletion (without release()) and ParallelTaskCompletion (with release()) accordingly.

The aforementioned fixes have resolved all 0x92000006 Panics and DEXT Corpse crashes. Unplugging the hardware or deactivating the DEXT while the driver is in a hung state no longer triggers a panic.

We are in a logical deadlock. The kernel dispatches a probe command before UserCreateTargetForID returns, and both of our methods for handling this command result in a permanent hang of the registration process:

Scenario A (Reporting SUCCESS): We mark the target as "Ready" during the UserInitializeTargetForID phase, allowing the command to be processed by the hardware and completed as SUCCESS by the ISR (with a fully initialized data structure).

Result: UserCreateTargetForID hangs indefinitely. Logs confirm the ISR successfully calls BundledParallelTaskCompletion, but the registration call never returns.

Scenario B (Reporting Selection Timeout): We mark the target as "Ready" only after UserCreateTargetForID returns and intercept the preemptive command in UserProcess... to report SERVICE_DELIVERY_FAILURE.

Result: UserCreateTargetForID also hangs indefinitely.

In your latest reply, you suggested the deadlock might be caused by our queue configuration. We have reviewed our architecture:

Our UserCreateTargetForID runs on a dedicated serial queue (AuxiliaryQueue).
All I/O completions (BundledParallelTaskCompletion) occur on the InterruptQueue or DefaultQueue.
There are no shared IOLocks between these execution paths.

Based on this, a queue deadlock within the DEXT seems unlikely.

In both Scenarios A and B, we observed the exact same behavior: the hung UserCreateTargetForID call returns success immediately only when we manually deactivate the driver, triggering the Stop() sequence.

It appears the kernel's registration thread is waiting for a signal that the DEXT has not yet sent, and this signal is only triggered when the driver terminates.

We have now ruled out data structure corruption, API misuse, and DEXT-level queue deadlocks. Is there anything else we should be aware of that we might have missed?

Best Regards,

Charles

Answer 21

DTS Engineer OP

Apple

Feb ’26

First off, I want to start with a clarification here:

We are in a logical deadlock. The kernel dispatches a probe command before UserCreateTargetForID returns, and both of our methods for handling this command result in a permanent hang of the registration process:

Calling "UserCreateTargetForID" means "please create the storage stack for this target". Returning from it means "I've finished creating the storage stack for this target". I'm not sure how far up the stack you'll actually get, but it's conceivable that we'd get all the way through partition map interpretation and (possibly) volume format detection BEFORE UserCreateTargetForID returns. You're basically guaranteed to get I/O request before UserCreateTargetForID returns.

[1] I think the upper levels of the SAM stack prevent this by returning from state before calling registerForService on their IOStorage family nubs, but there's no technical reason why they'd HAVE to work this way.

That leads to here:

We mark the target as "Ready" during the UserInitializeTargetForID phase, allowing the command to be processed by the hardware and completed as SUCCESS by the ISR (with a fully initialized data structure).

I'm still confused by this. Calling "UserCreateTargetForID" means "I'm ready this device to start working", which means your entire I/O "chain" for that device should be "ready" before you call it. In any case, the point here is that you shouldn't call UserCreateTargetForID until you’re ready to handle "arbitrary" I/O requests, just like you would once the controller is fully active.

Moving to here:

Our UserCreateTargetForID runs on a dedicated serial queue (AuxiliaryQueue).

Just to clarify, how does your larger code "around" UserCreateTargetForID actually work?

For reference, our controller does some basic configuration in UserStartController(), calls into AsyncEventHandler, then immediately returns. That handler does some synchronous I/O and DMA configuration, eventually calling UserCreateTargetForID(). However, the critical point here is that "nothing" else is happening on the default queue (where UserStartController() was called) once the creation process for UserCreateTargetForID starts.

Also, just to be clear, ONLY two methods should be tied to that method. Those are the declarations of UserCreateTargetForID which you inherit:

virtual kern_return_t
UserCreateTargetForID ( SCSIDeviceIdentifier    targetID,
						OSDictionary *          targetDict ) QUEUENAME ( AuxiliaryQueue );

And the HandleAsyncEvent method you set up to call it through:

virtual void
AsyncEventHandler     ( OSAction *					action TARGET,
						kern_return_t				status ) = 0;

virtual void
HandleAsyncEvent ( OSAction *				action,
				   kern_return_t			status )
				   TYPE ( ExampleSCSIDext::AsyncEventHandler ) QUEUENAME ( AuxiliaryQueue );

Covering the other details, the setup code for this looks like this:

kern_return_t
IMPL ( ExampleSCSIDext, UserInitializeController )
{
	kern_return_t ret = kIOReturnSuccess;
...	
	ret = IODispatchQueue::Create ( "FCQ", 0, 0, &ivars->fQueue1 );
	assert(kIOReturnSuccess == ret);
	
	ret = SetDispatchQueue ( "AuxiliaryQueue", ivars->fQueue1 );
	assert ( kIOReturnSuccess == ret );
		
	ret = CreateActionHandleAsyncEvent ( sizeof ( void * ), &ivars->fAsyncEventHandler );
	assert ( kIOReturnSuccess == ret );
...
}

And the call to it like this:

kern_return_t
IMPL ( AppleLSIFusionFC, UserStartController )
{
	kern_return_t ret = kIOReturnSuccess;
...

	AsyncEventHandler ( ivars->fAsyncEventHandler, kIOReturnSuccess );

	return ret;
}

It appears the kernel's registration thread is waiting for a signal that the DEXT has not yet sent, and this signal is only triggered when the driver terminates.

Sort of. I think what's actually happening is that the kernel is issuing an I/O request, then blocking because that I/O request isn't returning properly. In terms of direct investigation, the main thing to look at here is spindump while your DEXT is hung:

sudo spindump -o <destination path>

The spindump file will include the kernel frames which should, in theory, show what the kernel is actually hung on. However, if you want me to look into this, then please do that following:

Make sure your DEXT is logging as much as it possibly can. I'd log every method entry and exit, along with additional logging every time you call into the kernel. Basically, you want your DEXT to be as visible as possible in the system console.
Reproduce the hang, then trigger a sysdiagnose.
Pull the device to trigger driver teardown, let everything finish, then trigger another sysdiagnose.

Label both sysdiagnoses so I know which is which, then upload both of them to your bug. FYI, I'm asking for two sysdiagnoses (instead of just doing one after the test finishes) because I want to see the spindump and registry state at the point of the hang plus the log data from the hang. I don't necessarily "need" the second sysdiagnose, but it's possible that logging from teardown will show me something interesting.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 22

charles.cc OP

2w

Hi Kevin,

Thank you for your previous guidance regarding memory ownership and OSAction lifecycle management. The Bundled Mode implementation is now stable, and the driver operates without crashes.

We have conducted performance stress tests using an LSI 3108 RAID controller via Thunderbolt 3, configured with 4 HDDs in RAID 5. The results from AJA System Test Lite (4GB, 10bit RGB) and fio (BS=1M, iodepth=32, Direct=1) show that I/O throughput in Bundled Mode remains at approximately 800 MB/s, which is identical to the performance measured in Legacy Mode.

Analysis of the driver logs during these tests reveals that the count parameter in UserProcessBundledParallelTasks is consistently 1, even under high queue depth. The batching mechanism intended to reduce RPC and context switch overhead does not appear to be active.

Our implementation for reporting device capacity and task limits follows the standard IOUserSCSIParallelInterfaceController specifications. UserReportHighestSupportedDeviceID returns an ID of 63, and UserReportMaximumTaskCount is set to 64 as shown below:

kern_return_t IMPL(DRV_MAIN_CLASS_NAME, UserReportHighestSupportedDeviceID)
{
    *id = 63;
    return kIOReturnSuccess;
}

kern_return_t IMPL(DRV_MAIN_CLASS_NAME, UserReportMaximumTaskCount)
{
    *count = 64;
    return kIOReturnSuccess;
}

The processing entry point consistently reports only a single slot per call:

void 
DRV_MAIN_CLASS_NAME::UserProcessBundledParallelTasks_Impl(
        const uint16_t indices[32],
        uint16_t count,
        OSAction * completion)
{
    // Log consistently shows 'count' is 1
    uLog("[BUNDLED] Mode enabled, processing %u slots", count);
    
    if (!ivars->useBundledMode || !ivars->fCommandBuffers || !ivars->fResponseBuffers) {
        return;
    }
    
    for (uint16_t i = 0; i < count; i++) {
        uint16_t slotIndex = indices[i];
        // Dispatch logic...
    }
}

Example of the console log output during fio testing with iodepth=32:

default 20:45:33.098031+0800 kernel [UserProcessBundledParallelTasks_Impl] [BUNDLED] Mode enabled, processing 1 slots

Given these observations, we would like to understand the specific conditions or thresholds required for the kernel's block layer to batch multiple SCSI commands into a single RPC call. We are interested in whether the kernel's bundling logic prioritizes high IOPS (Random I/O) over high bandwidth (Sequential I/O), and if the 1MB block size used in our tests might cause the kernel to dispatch commands immediately to minimize latency instead of waiting to bundle them.

Furthermore, we would like to verify if there are other reported parameters beyond UserReportMaximumTaskCount that influence the frequency of batching, or if there is a recommended method to determine if the macOS Block Layer is attempting to bundle commands before they reach the DEXT layer.

Thank you for your expertise and time.

Best Regards,

Charles

Answer 23

DTS Engineer OP

Apple

2w

We have conducted performance stress tests using an LSI 3108 RAID controller via Thunderbolt 3, configured with 4 HDDs in RAID 5. The results from AJA System Test Lite (4GB, 10-bit RGB) and fio (BS=1M, iodepth=32, Direct=1) show that I/O throughput in Bundled Mode remains at approximately 800 MB/s, which is identical to the performance measured in Legacy Mode.

Do either of these tests generate parallel I/O (meaning, multiple threads are generating I/O to the same target)? More specifically, I'd look at:

Parallel I/O within the test itself.
I/O targeting different partitions on the same target.
I/O targeting unrelated targets on the same controller.

Also, what does your overall I/O flow look like? Is your card receiving commands fast enough that it's processing them in parallel, even though they arrived as separate commands? What's your peak simultaneous task count?

Analysis of the driver logs during these tests reveals that the count parameter in UserProcessBundledParallelTasks is consistently 1, even under high queue depth. The batching mechanism intended to reduce RPC and context switch overhead does not appear to be active.

One quick note here is that the batching mechanism is actually trying to address two different issues simultaneously:

Batching commands so as to reduce the number of IPC calls between your DEXT and the kernel.
Preallocate and reuse I/O buffers to reduce the overhead of copying data in/out of the kernel.

Note that the first one in particular primarily benefits cases where the system is flooded with LOTS of very small I/O requests (eg ~4k). Otherwise, the I/O cost itself becomes the primary performance factor.

We are interested in whether the kernel's bundling logic prioritizes high IOPS (Random I/O) over high bandwidth (Sequential I/O), and if the 1MB block size used in our tests might cause the kernel to dispatch commands immediately to minimize latency instead of waiting to bundle them.

I think "bundling logic" overstates what's actually going on here. By the time the command reaches the SCSI controller layer, the system isn't really trying/willing to "hold" any request, so it's going to try and send it to your driver "as soon as possible". The way this actually works is that:

The kernel driver receives and processes requests on one thread (preparing them for your DEXT).
Once the command is prepped, it's transferred to another thread (which actually sends the command to your DEXT).

If the total command volume is relatively low, that process is effectively "serial“ - the send queue is always idle when the prep queue has a command ready, so it sends the command "immediately". As command volume increases, you'll eventually reach the point where the send queue IS busy (because it's sending a command to your DEXT), so that command is prepped.

Where bundled I/O matters is in how those queued commands are handled - bundled I/O means that the kernel can send that "backlog" in a single bundled command instead of sending individual commands for each entry.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 24

charles.cc OP

2w

Hi Kevin,

Thank you for your guidance on the UserProcessBundledParallelTasks architecture and IPC optimization. After several adjustments, our DEXT driver is now running stably. We have completed a comprehensive performance comparison between Bundled Mode and Legacy Mode. Below is a summary of our findings.

Following your suggestion, we analyzed the console logs during 4K Random I/O stress tests (iodepth=32). By analyzing approximately 6,363 calls, we confirmed that the kernel actively performs command batching:

Results:

Slots per Call (Count)FrequencyObservation

1 Slot	5,654	Standard/Low load; kernel dispatches immediately.
2 Slots	628	Active Batching initiated under increased pressure.
3 - 4 Slots	75	Sustained high-concurrency batching.
5+ Slots	6	High-load peaks (Max observed: 11 slots per call).

This data confirms that the batching mechanism effectively reduces the number of RPC/IPC calls to the DEXT when the system is under pressure.

We conducted benchmarks using fio on RAID 5 array (4 HDDs) connected via Thunderbolt 3 (4K Random R/W, Buffered I/O, 4 Jobs, Queue Depth 32).

Key Performance:

Mixed Random (70% Read / 30% Write):
- Read Throughput: 278 MiB/s (Legacy) → 536 MiB/s (Bundled) [~1.93x improvement]
- Write Throughput: 120 MiB/s (Legacy) → 231 MiB/s (Bundled) [~1.92x improvement]
Random Read (4K):
- Throughput: 121 MiB/s (Legacy) → 142 MiB/s (Bundled) [~17% improvement]
Random Write (4K):
- Performance was comparable (~35-36 MiB/s), limited by the physical seek characteristics of the HDDs.

As you anticipated, the advantages of Bundled Mode are most significant in mixed random workloads, where we observed nearly a 2x increase in throughput. This validates that reducing IPC overhead by utilizing shared command/response buffers significantly boosts performance when the system is saturated with a large volume of small I/O requests.

Thank you again for your expert assistance and professional advice throughout this process.

Best regards,

Charles

Answer 25

DTS Engineer OP

Apple

2w

A few minor comments:

As you anticipated, the advantages of Bundled Mode are most significant in mixed random workloads, where we observed nearly a 2x increase in throughput. This validates that reducing IPC overhead by utilizing shared command/response buffers significantly boosts performance when the system is saturated with a large volume of small I/O requests.

I know I suggested 4k, but I'd actually retest with 16k or, better yet, small multiples of 16k. That's the new page size and I'd expect the system’s "natural" I/O pattern to be multiples of 16k. Subpage I/O is also going to have some amount of performance impact (though I'm not sure how large), so I'd be curious what a more "natural" looking test looks like.
My intuition is that real-world performance impact is significantly larger than would be "obvious"[1] in synthetic benchmarks. The reality is that what generally pushes modern I/O systems is primarily "broad" demand from a variety of different sources, not simply high demand from a single source.

[1] I suspect this is also why the bundled I/O model was added later instead of being included in the initial release. The bottleneck in the original approach isn't clear until you subject the system to the kind of random I/O arrival patterns that real-world usage generates.

Thank you again for your expert assistance and professional advice throughout this process.

As always, you're very welcome!

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware