Architectural Performance Difference in External Storage I/O Between Intel and Apple Silicon Macs

Hi everyone,

We are in the process of migrating a legacy KEXT for our external multi-disk RAID enclosure to the modern DriverKit framework. During the performance validation of our KEXT, we observed a large and consistent maximum throughput difference between Intel-based Macs and Apple Silicon-based Macs. We would like to share our findings and hope to discuss with others in the community to see if you have had similar experiences that could confirm or correct our understanding.

The Observation: A Consistent Performance Gap

When using the exact same external RAID hardware (an 8-HDD RAID 5 array), driven by our mature KEXT, we see the following results in high-throughput benchmarks (AJA System Test, large sequential writes):

  • On a 2020 Intel-based Mac: We consistently achieve a throughput of ~2500 MB/s.
  • On modern M-series Macs (from M1 to M4): The throughput is consistently capped at ~1500 MB/s.

This performance difference of nearly 40% is significant and is present across the entire Apple Silicon product line.

Our Hypothesis: A Shift in Architectural Design Philosophy

Since the KEXT and external hardware are identical in both tests, we believe this performance difference is not a bug but a fundamental platform architecture distinction. Our hypothesis is as follows:

1. The Intel Mac Era ("Dedicated Throughput") The Intel-based Macs we tested use a dedicated, discrete Intel Thunderbolt controller chip. This chip has its own dedicated PCIe lanes and resources, and its design appears to be singularly focused on maximizing raw, sustained data throughput for external peripherals.

2. The Apple Silicon Era ("Integrated Efficiency") In contrast, M-series Macs use a deeply integrated I/O controller inside the SoC. This controller must share resources, such as the total unified memory bandwidth and the chip's overall power budget, with all other functional units (CPU, GPU, etc.).

We speculate that the design priority for this integrated I/O controller has shifted from "maximizing single-task raw throughput" to "maximizing overall system efficiency, multi-task responsiveness, and low latency." As a result, in a pure, single-task storage benchmark, its performance ceiling may be lower than that of the older, dedicated-chip architecture.

Our Question to the Community:

Is our understanding correct? Have other developers of high-performance storage drivers or peripherals also observed a similar performance ceiling for external storage on Apple Silicon Macs, when compared to high-end Intel Macs?

We believe that understanding this as a deliberate architectural trade-off is crucial for setting realistic performance targets for our DEXT. Our current goal has been adjusted to have our DEXT match the KEXT's ~1500 MB/s on the M-series platform.

Any insights, confirmations, or corrections from the community or Apple engineers would be greatly appreciated.

Thank you very much!

Charles

Is our understanding correct? Have other developers of high-performance storage drivers or peripherals also observed a similar performance ceiling for external storage on Apple Silicon Macs, when compared to high-end Intel Macs?

SO, I have two response to this:

  1. If you haven't already, please file a bug on this and post the bug number back here. I'd like to discuss this with the engineering team and it's helpful to have a external bug to anchor that conversation.

  2. If you're using the original UserProcessParallelTask architecture, then switching to UserProcessBundledParallelTasks will probably provide significant performance benefits.

Finally, one other comment here:

Since the KEXT and external hardware are identical in both tests, we believe this performance difference is not a bug but a fundamental platform architecture distinction.

One thing I'm not sure about here is how the configuration from UserReportHBAConstraints and UserGetDMASpecification configuration described here effect overall performance. Conceptually, I think there are two "styles" of DMA management that are in common usage:

  1. The "legacy" flow, which treats wired memory as a scarce resource and tries to (relatively) minimize the number of wired bytes.

  2. The "modern" flow, which assumes the presence of the DART and relies on being able to basically wires arbitrarily large chunks of memory.

Realistically, your KEXT is almost certainly written around #1. I don't think the advantages of #2 were all that obvious when the 64-bit transition first occurred and, more importantly, I'm not sure how big those advantages actually were under Intel. However, my intuition is that #2 has a MUCH larger impact on ARM than it did on Intel.

That leads to here:

Our current goal has been adjusted to have our DEXT match the KEXT's ~1500 MB/s on the M-series platform.

The DEXT transition also introduces it's own set of performance bottlenecks, due to the higher IPC cost of DEXT to KEXT communication. My concern here is that you're actually comparing implementations that are each being artificially slowed by unrelated bottlenecks. That is, legacy DMA is slowing down your KEXT while serial (vs parallel) task processing is slowing down your DEXT.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Architectural Performance Difference in External Storage I/O Between Intel and Apple Silicon Macs
 
 
Q