Hello everyone,
We are migrating our KEXT for a Thunderbolt storage device to a DEXT based on IOUserSCSIParallelInterfaceController.
We've run into a fundamental issue where the driver's behavior splits based on the I/O source: high-level I/O from the file system (e.g., Finder, cp) is mostly functional (with a minor ls -al sorting issue for Traditional Chinese filenames), while low-level I/O directly to the block device (e.g., diskutil) fails or acts unreliably. Basic read/write with dd appears to be mostly functional.
We suspect that our DEXT is failing to correctly register its full device "personality" with the I/O Kit framework, unlike its KEXT counterpart. As a result, low-level I/O requests with special attributes (like cache synchronization) sent by diskutil are not being handled correctly by the IOUserSCSIParallelInterfaceController framework of our DEXT.
Actions Performed & Relevant Logs
1. Discrepancy: diskutil info Shows Different Device Identities for DEXT vs. KEXT
For the exact same hardware, the KEXT and DEXT are identified by the system as two different protocols.
KEXT Environment:
Device Identifier: disk5
Protocol: Fibre Channel Interface
...
Disk Size: 66.0 TB
Device Block Size: 512 Bytes
DEXT Environment:
Device Identifier: disk5
Protocol: SCSI
SCSI Domain ID: 2
SCSI Target ID: 0
...
Disk Size: 66.0 TB
Device Block Size: 512 Bytes
2. Divergent I/O Behavior: Partial Success with Finder/cp vs. Failure with diskutil
-
High-Level I/O (Partially Successful): In the DEXT environment, if we operate on an existing volume (e.g.,
/Volumes/MyVolume), file copy operations usingFinderorcpsucceed. Furthermore, the logs we've placed in our single I/O entry point,UserProcessParallelTask_Impl, are triggered.- Side Effect: However, running
ls -alon such a volume shows an incorrect sorting order for files with Traditional Chinese names (they appear before.and..).
- Side Effect: However, running
-
Low-Level I/O (Contradictory Behavior): In the DEXT environment, when we operate directly on the raw block device (
/dev/disk5):diskutil partitionDisk ...-> Fails 100% of the time with the error:Error: -69825: Wiping volume data to prevent future accidental probing failed.ddcommand -> Basic read/write operations appear to work correctly (a write can be immediately followed by a read within the same DEXT session, and the data is correct).
3. Evidence of Cache Synchronization Failure (Non-deterministic Behavior)
The success of the dd command is not deterministic. Cross-environment tests prove that its write operations are unreliable:
-
First Test:
- In the DEXT environment, write a file with random data to
/dev/disk5usingdd. - Reboot into the KEXT environment.
- Read the data back from
/dev/disk5usingdd. The result is a file filled with all zeros.
- Conclusion: The write operation only went to the hardware cache, and the data was lost upon reboot.
- In the DEXT environment, write a file with random data to
-
Second Test:
- In the DEXT environment, write the same random file to
/dev/disk5usingdd. - Key Variable: Immediately after, still within the DEXT environment, read the data back once for verification. The content is correct!
- Reboot into the KEXT environment.
- Read the data back from
/dev/disk5. This time, the content is correct!
- Conclusion: The additional read operation in the second test unintentionally triggered a hardware cache flush. This proves that the
dd(in our DEXT) write operation by itself does not guarantee synchronization, making its behavior unreliable.
- In the DEXT environment, write the same random file to
Our Problem
Based on the observations above, we have the conclusion:
-
High-Level Path (triggered by
Finder/cp): When an I/O request originates from the high-level file system, the framework seems to enter a fully-featured mode. In this mode, all SCSI commands, includingREAD/WRITE,INQUIRY, andSYNCHRONIZE CACHE, are correctly packaged and dispatched to ourUserProcessParallelTask_Implentry point. Therefore, Finder operations are mostly functional. -
Low-Level Path (triggered by
dd/diskutil): When an I/O request originates from the low-level raw block device layer:- The most basic
READ/WRITEcommands can be dispatched (which is whyddappears to work). - However, critical management commands, such as
INQUIRYandSYNCHRONIZE CACHE, are not being correctly dispatched or handled. This leads to the incorrect device identification indiskutil infoand the failure ofdiskutil partitionDiskdue to its inability to confirm cache synchronization.
- The most basic
We would greatly appreciate any guidance, suggestions, or insights on how to resolve this discrepancy. Specifically, what is the recommended approach within DriverKit to ensure that a DEXT based on IOUserSCSIParallelInterfaceController can properly declare its capabilities and handle both high-level and low-level I/O requests uniformly?
Thank you.
Charles
Based on the setProperty calls from our KEXT's source code and the properties from the .ioreg analysis, we have implemented the following four-part configuration in our DEXT:
I'm confused. Above I told you that the problem was:
- You've failed to define kIOMaximumSegmentByteCount* keys, as UserReportHBAConstraints basically "requires".
Have you defined that key?
More specifically, UserReportHBAConstraints() as a list of required keys:
Key: Required:
kIOMaximumSegmentCountReadKey = Yes
kIOMaximumSegmentCountWriteKey = Yes
kIOMaximumSegmentByteCountReadKey = Yes
kIOMaximumSegmentByteCountWriteKey = Yes
kIOMinimumSegmentAlignmentByteCountKey = Yes
kIOMaximumSegmentAddressableBitCountKey = Yes
kIOMinimumHBADataAlignmentMaskKey = Yes
Your DEXT has not defined all of them and, as they are required, you should not expect your DEXT to function properly until you've defined all of them.
Similarly:
Low-Level Sync (UserGetDMASpecification): We also set maxTransferSize to 512KB to ensure consistency with the HBA layer.
I haven't gotten explicit confirmation from the engineering team, but at this point I'm fairly convinced that maxTransferSize MUST either:
maxTransferSize >= kIOMaximumSegmentCountReadKey * kIOMaximumSegmentByteCountReadKey
OR
maxTransferSize >= kIOMaximumSegmentCountWriteKey * kIOMaximumSegmentByteCountWriteKey
...whichever of the two is larger. maxTransferSize basically defines "the largest possible transfer your controller could EVER handle", which would obviously be your segment count * segment size. Critically, using a smaller maxTransferSize won't cause an immediate failure, only a later failure if/when the kernel actually tried to "give you" a large enough transfer.
I've actually just filed a bug on this (r.164177660) asking that IOUserSCSIParallelInterfaceController fail completely if configuration is incomplete or if that configuration is in any way incomplete.
(We have confirmed that IORegistryExplorer shows the Protocol Characteristics were set successfully, but IOMaximumBlockCountWrite remains 0xffff.)
Yes. Your DEXT cannot set kIOMaximumByteCountWriteKey, which means it will return to its default value of 0xffff.
The failure of bs=768k, however, shows that IOKit's IOBreaker splitting functionality has not been successfully activated. If it had been, we should have at least seen the first 512KB sub-request in our DEXT's log, but in fact, we saw nothing.
No, that's NOT what it shows. Your DEXT is going to fail ANY transfer larger than 512KB because that's what YOU defined maxTransferSize as:
The failure of bs=768k, however, shows that IOKit's IOBreaker splitting functionality has not been successfully activated.
You're right, it's not activating. That's because you're currently telling the kernel that you can handle a transfer up to:
IOMaximumBlockCountRead = Oxffff -> 31 MB
...so any transfer smaller than 31 MB will be passed directly down the I/O stack. However, IOUserSCSIParallelInterfaceController is going to fail any transfer smaller than maxTransferSize.
Again, the way to fix this is to:
-
Define the two kIOMaximumSegmentCount keys, so that IOBreaker will divide the requests into something "reasonable".
-
Set maxTransferSize to a large enough value that it can handle the configuration you're creating in #1.
Our plan for the next step is to manually write a "second-layer I/O splitter" from scratch within our UserProcessParallelTask_Impl (for example, to split a received 512KB request into eight 64KB hardware commands).
That depends on what you mean by "splitter". The expected implementation here is that you'll take the value fBufferIOVMAddr and use basic math to divide it up into smaller chunks, each of which will be a scatter gather entry you pass over to your card.
Historically, I believe this was done by subdividing the IOMemoryDescriptor and generating individual IODMACommands; however, the nature of the DART means that this is somewhat silly and unnecessary, so it's done with a single IODMACommand over the entire descriptor.
Finally, making sure this is as clear as possible, this does mean that maxTransferSize is likely to be much, potentially MUCH, larger than you'd "expect" in the older architecture. I don't have the data at hand to validate the exact number, but I believe there is a Fibre Channel DEXT which is setting maxTransferSize to ~1 GB.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware