DEXT (IOUserSCSIParallelInterfaceController): Direct I/O Succeeds, but Buffered I/O Fails with Data Corruption on Large File Copies

Hi all,

We are migrating a SCSI HBA driver from KEXT to DriverKit (DEXT), with our DEXT inheriting from IOUserSCSIParallelInterfaceController. We've encountered a data corruption issue that is reliably reproducible under specific conditions and are hoping for some assistance from the community.

Hardware and Driver Configuration:

  • Controller: LSI 3108
  • DEXT Configuration: We are reporting our hardware limitations to the framework via the UserReportHBAConstraints function, with the following key settings:
    // UserReportHBAConstraints...
    addConstraint(kIOMaximumSegmentAddressableBitCountKey, 0x20); // 32-bit
    addConstraint(kIOMaximumSegmentCountWriteKey, 129);
    addConstraint(kIOMaximumByteCountWriteKey, 0x80000); // 512KB
    

Observed Behavior: Direct I/O vs. Buffered I/O

We've observed that the I/O behavior differs drastically depending on whether it goes through the system file cache:

1. Direct I/O (Bypassing System Cache) -> 100% Successful

When we use fio with the direct=1 flag, our read/write and data verification tests pass perfectly for all file sizes, including 20GB+.

2. Buffered I/O (Using System Cache) -> 100% Failure at >128MB

Whether we use the standard cp command or fio with the direct=1 option removed to simulate buffered I/O, we observe the exact same, clear failure threshold:

  • Test Results:

    • File sizes ≤ 128MB: Success. Data checksums match perfectly.
    • File sizes ≥ 256MB: Failure. Checksums do not match, and the destination file is corrupted.
  • Evidence of failure reproduced with fio (buffered_integrity_test.fio, with direct=1 removed):

    • fio --size=128M buffered_integrity_test.fio -> Test Succeeded (err=0).
    • fio --size=256M buffered_integrity_test.fio -> Test Failed (err=92), reporting the following error, which proves a data mismatch during the verification phase:
      verify: bad header ... at file ... offset 1048576, length 1048576
      fio: ... error=Illegal byte sequence
      

Our Analysis and Hypothesis

The phenomenon of "Direct I/O succeeding while Buffered I/O fails" suggests the problem may be related to the cache synchronization mechanism at the end of the I/O process:

  1. Our UserProcessParallelTask_Impl function correctly handles READ and WRITE commands.
  2. When cp or fio (buffered) runs, the WRITE commands are successfully written to the LSI 3108 controller's onboard DRAM cache, and success is reported up the stack.
  3. At the end of the operation, to ensure data is flushed to disk, the macOS file system issues an fsync, which is ultimately translated into a SYNCHRONIZE CACHE SCSI command (Opcode 0x35 or 0x91) and sent to our UserProcessParallelTask_Impl.
  4. We hypothesize that our code may not be correctly identifying or handling this SYNCHRONIZE CACHE opcode. It might be reporting "success" up the stack without actually commanding the hardware to flush its cache to the physical disk.
  5. The OS receives this "success" status and assumes the operation is safely complete.
  6. In reality, however, the last batch of data remains only in the controller's volatile DRAM cache and is eventually lost.
  7. This results in an incomplete or incorrect file tail, and while the file size may be correct, the data checksum will inevitably fail.

Summary

Our DEXT driver performs correctly when handling Direct I/O but consistently fails with data corruption when handling Buffered I/O for files larger than 128MB. We can reliably reproduce this issue using fio with the direct=1 option removed.

The root cause is very likely the improper handling of the SYNCHRONIZE CACHE command within our UserProcessParallelTask. P.S. This issue did not exist in the original KEXT version of the driver.

We would appreciate any advice or guidance on this issue.

Thank you.

We've observed that the I/O behavior differs drastically depending on whether it goes through the system file cache:

Quick question— how are you validating what the actual issue is? More specifically, are you pulling, unmounting the device, and testing with a known good driver? Or are you testing with your development DEXT? That's crucial because testing through your DEXT means you don't know whether this is a write or a read issue.

That leads to here:

At the end of the operation, to ensure data is flushed to disk, the macOS file system issues an fsync, which is ultimately translated into a SYNCHRONIZE CACHE SCSI command (Opcode 0x35 or 0x91) and sent to our UserProcessParallelTask_Impl.

Are you sure about that? How have you validated that? I haven't tried to validate the entire I/O path, but I’m fairly sure that copyfile() (what cp calls) does not call fsync().

FYI, the history here is somewhat complicated and “ugly," but in general, if the system were going specifically trying to flush data, it would probably have called F_FULLFSYNC. However, the cost of that is fairly high, so it isn't something the system routinely does.

That leads me to here:

verify: bad header ... at file ... offset 1048576, length 1048576

That's an oddly specific offset, as it's exactly 1 MB. I don't see how that would track with your current theory.

In terms of next steps, I think I'd start by making sure you know EXACTLY what happened. On the testing side, my suggestion would be to do something like this:

  1. Zero out the entire drive.

  2. Partition the drive into "small-ish" volumes (so you minimize how much of the device the system will target).

  3. Format that partition with the simplest possible file system. I'd start with FAT or possibly unencrypted, non-journaled HFS+. I definitely would NOT use APFS. Too much complexity/noise.

  4. Reproduce the issue in the most controlled way possible.

  5. Read the driver with a trustworthy controller so you can see EXACTLY what happened.

On the driver side, I'd focus on trying to figure out exactly what's being sent to the device. I'm not sure the kernel log will handle it, but I'd be tempted to build your own user client and use that to log out EVERY command that goes in/out of your DEXT. You won't be able to see the data itself, but you don't need to see the data as that offset should be "enough" to follow what's going on.

Related to that point, I'd focus your attention on unmount, NOT the copy itself. Unmount is the only point the system ACTUALLY promises to flush "everything", so that's the point you "know" the data had to reach your driver.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

TL;DR (Core Issues)

  1. 100% Reproducible: We used two completely independent test methods (cp + sync and fio buffered mode), and both consistently reproduced the data corruption issue under Buffered I/O.
  2. Clear Failure Threshold: Both tests point to the exact same failure boundary: file sizes 128MB (inclusive) and below pass, but 256MB (inclusive) and above consistently fail.
  3. Key Clue Confirmed: The fio test again reported the exact same error as our initial finding: data corruption starts at a very precise offset 1048576 (1MB) location. You mentioned last time that this offset was "odd"; now we can confirm this is not a coincidence, but a core characteristic of the problem.

Test 1: System-Level File Copy Integrity Test (cp + sync)

This test aims to simulate standard user operations to rule out tool-specific issues.

  • Methodology: We used a shell script to automate cp for file copying, used sync to force the system to flush I/O caches, and finally compared the SHA256 hashes of the source and destination files.
  • Results:
    • 128MB File: Success. Hashes match perfectly.
    • 256MB File: Failure. Hash mismatch, confirming data corruption during the copy process.
  • Conclusion: This proves that the issue lies within the system I/O stack, and our DEXT has a defect when handling standard buffered writes.

Test 2: fio Buffered I/O Integrity Test

This test leverages fio's robust verification capabilities to obtain lower-level error details.

Settings & Expected Behavior

  • Settings (buffered_integrity_test.fio):

    • We removed direct=1 to force fio to use the OS file cache, consistent with the behavior of cp.
    • We set do_verify=1 and verify=sha256 to require data integrity checks after I/O operations.
  • fio Mechanism: The verification process consists of two independent phases (it does not verify-as-it-writes):

    1. Write Phase: fio writes the entire 256MB data (each 1MB block carries a unique header for verification) to the file.
    2. Verify Phase: After writing concludes, fio reads back the entire 256MB starting from the beginning (offset 0), checking the headers block by block.

Test Results

  • The results match the cp test exactly: 128MB passes, 256MB fails.
  • During the 256MB failure, we received the following error message:
    [*] Testing file size: 256M
    ...
    verify: bad header ... at file ... offset 1048576, length 1048576
    fio: ... error=Illegal byte sequence
    ...
    [!!!] TEST FAILED at file size: 256M
    

Error Message Analysis

The fio error report indicates the following:

  1. verify: bad header: This means the data block read back by fio during the "Verify Phase" has a corrupted header.

  2. at file ... offset 1048576: This tells us:

    • fio successfully verified the first 1MB (offset 0 to 1048575).
    • The error occurred when fio attempted to verify the second 1MB data block (starting at offset 1,048,576).
    • This indicates that our driver or controller starts failing when handling I/O requests that cross this 1MB address boundary.

Synthesis & Next Steps

We suspect there is a bug in our DEXT driver when handling Buffered I/O writes. The trigger conditions are:

  • The I/O must be Buffered I/O.
  • The total I/O for a single file must exceed a threshold (> 128MB).
  • When these conditions are met, the driver or hardware errs when processing write requests that cross the 1MB byte boundary, causing corruption for all subsequent data.

This is likely related to the processing flow of the SYNCHRONIZE CACHE command, and how I/O requests are segmented or addressed in memory.

Our Next Plan: Following your suggestion, we will implement detailed logging within the DEXT. Our goal is to capture all SCSI commands and their parameters (specifically LBA and transfer length) entering and exiting the driver during the failed 256MB fio test. We will focus on observing exactly what happens when I/O requests touch and cross this critical 1MB location.

Best Regards,

Charles


(Attachments include the test scripts and full logs for both cp and fio.)

The error occurred when fio attempted to verify the second 1MB data block (starting at offset 1,048,576).

So what's IS there? And what should have been? These test use hashes for verification because it's the quickest way to validate the data, but the question I'm interested in here is what your driver actually "put" on disk.

Part of my concern here is that unless you artificially cut power to the device, we shouldn't have needed any explicit cache sync as part of the copy. The system should have flushed all of it's buffers to your DEXT as part of unmount and you should have flushed them to disk shortly after that. If this is really about cache flushing, that's the failure case I'd really be concerned* about here, not the individual copy.

*Failure to flush a file loses file data. Failure to flush file system blocks loses volumes.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

The data corruption issue has been resolved!

Special thanks for asking the key question: "So what's IS there?" This prompted us to shift our focus from high-level hash verification to inspecting the raw bytes written to the disk. This investigation revealed that the root cause was not related to cache flushing, but rather a hardware limitation regarding single transfer lengths.

Here is a summary of our findings and the solution:

By using a Python script to verify the data on disk byte-by-byte, we discovered:

  • When macOS coalesced writes into 2MB chunks (as we had previously set maxTransferSize to 128MB), data corruption consistently began exactly at Offset 1MB + 16KB within the command (manifesting as 0x00 or garbage data).
  • The LSI 3108 controller/firmware seems cannot correctly handle a single SCSI command with a data length exceeding 1MB in the DEXT environment.

We implemented a two-layer fix:

  • In UserGetDMASpecification, we explicitly set maxTransferSize to 1MB (1,048,576 bytes). This forces macOS to split large I/O requests into smaller chunks that the hardware can safely digest.
  • To align with SCSI best practices and ensure maximum stability, we implemented logic within the driver to further split these 1MB buffers into multiple 64KB SGL Segments when populating hardware descriptors.

With these fixes applied, all previously failing test scenarios (including cp + sync and fio, with file sizes ranging from 256MB up to 5GB) now PASS 100%. The checksums match perfectly.

Thanks again for your guidance — it pointed us directly to the core issue!

Best Regards,

Charles

Special thanks for asking the key question: "So what’s there?" This prompted us to shift our focus from high-level hash verification to inspecting the raw bytes written to the disk. This investigation revealed that the root cause was not related to cache flushing, but rather a hardware limitation regarding single transfer lengths.

Good, I'm glad that was helpful. One of the things I've learned in DTS is that it's very easy for a bug investigation to be derailed by jumping straight to "what went wrong" without really having looked closely at "what actually happened". We're so used to our code "doing what we think it does" that we skip right past the possibility that it simply ISN'T doing what we think it is.

With these fixes applied, all previously failing test scenarios (including cp + sync and fio, with file sizes ranging from 256MB up to 5GB) now PASS 100%. The checksums match perfectly.

Fabulous and congratulations!

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Regarding your previous analysis of the KEXT configuration, specifically where you pointed out the IOMaximumByteCount is actually 0x80000 (512KB) in this post.

We observed the runtime logs of the legacy KEXT, and the results align perfectly with your observation. Even when copying a 2GB file, the KEXT actually splits the request into 512KB (0x80000) chunks per transfer.

[ProcessParallelTask] [KEXT_CHECK] WRITE (0x2a) | T:0 L:0 | Size: 524288 bytes (512 KB)

This confirms that our current 1MB limit in the DEXT actually exceeds the behavior of the legacy driver. Since the current setting has passed all data integrity verifications, we consider the transfer size issue resolved and will not pursue larger transfer sizes at this stage.

As per your recommendation, we are currently implementing the UserProcessBundledParallelTasks architecture to strictly follow DriverKit best practices and optimize performance (it is currently causing the DEXT to crash, so we are in the middle of debugging it).

Thanks again for your precise insights!

Best Regards,

Charles

Hi Kevin,

Apologies, but it seems our previous celebration was premature. I must retract the conclusion from my last post regarding the issue being resolved.

Here is the current situation:

While our fixes regarding maxTransferSize (1MB) and SGL splitting have indeed achieved a 100% pass rate for CLI tools (cp and fio) across all file sizes, we have discovered that Finder still fails with Error -36 during file copies.

Based on our latest findings, I would appreciate your insights on the following:

  • CLI (cp, fio): Read/Write operations are 100% successful, even with large files (e.g., 4GB).
  • Finder:
    • Copying "Pre-existing files" (Cold Data): Success for files < 64MB.
    • Copying "Freshly created files" (Dirty Data/Hot Cache): Immediate failure (Error -36) for files > 1MB.

We suspected that "freshly created files" were highly fragmented in memory, causing the DEXT to construct a Scatter/Gather List (SGL) that exceeded our hardware limit (129 Segments).

To verify this, we implemented "Trap Logs" inside the DEXT to monitor the SGCount. The logs proved this hypothesis wrong. Finder is not sending requests that exceed the limit.

At the exact moment Finder reports the error, the DEXT is receiving a series of extremely small and clean requests, not fragmented large I/O:

  • Request sizes are mostly 4KB (0x1000) or 16KB (0x4000).
  • The SGCount for each request is only 0 or 1.
  • This is far below our hardware limit of 129.

This leaves us with a confusing contradiction:

  • CLI (fio/cp): Sends larger chunks (e.g., 512KB), SGCount = 1 (likely due to contiguous allocation by fio). Result: Success.
  • Finder: Sends tiny chunks (4KB/16KB), SGCount = 1. Result: Failure (Error -36).

To rule out configuration variables, we have set the DEXT's UserReportHBAConstraints to match our stable Legacy KEXT exactly:

  • IOMaximumByteCount = 512 KB (0x80000)
  • IOMaximumSegmentCount = 129

Since the logs prove that Finder's requests fully comply with our reported Constraints, why would DriverKit or macOS determine a failure?

Does Finder's I/O behavior trigger specific SCSI command sequences or timing issues that CLI tools do not?

We observed that Finder issues multiple TEST UNIT READY and SYNCHRONIZE CACHE commands immediately before the write failure.

Does this imply that some error handling mechanism is being triggered prematurely?

Thank you again for your time and assistance.

Best Regards,

Charles

DEXT (IOUserSCSIParallelInterfaceController): Direct I/O Succeeds, but Buffered I/O Fails with Data Corruption on Large File Copies
 
 
Q