Latency critical DMA read via PCIe

Dear All,

I am currently developing a high throughput audio system which operates via PCIe tunneled into a USB4 interface. This include a custom FPGA based hardware and custom Audio DriverKit driver.

While performing read operation via the hw DMA (that is a Host to Device transfer), I am noticing sparse latency spikes into the read transfers. Specifically, 4KB operations (which I assume including MRd + CpID) take normally from 5us to 40us to be completed, perfectly fine for my case. However, in some rare occasions, they can end up to 400us, which causes me overruns. The measurements have been carried out from the FPGA and they include the overall request and transfer time.

While trying to tackle the problem, I'm investigating the possible power saving options and performance constraint methods at my disposal. I currently use these methods to mitigate the problem.

ChangePowerState(kIOServicePowerCapabilityOn); SetPowerOverride(true); RequireMaxBusStall(kIOMaxBusStall25usec); CreatePMAssertion(kIOServicePMAssertionCPUBit | kIOServicePMAssertionForceFullWakeupBit, &ivars->PMAssertionID, false);

The buffers are currently about 16MB, single segment, 16KB aligned and, of course, "prepared" for DMA.

The system run for 3 hours without any overrun, but I'm not still fully convinced about its reliability. May someone provide me some comments on this? Are there profiling tools that I can use?

Feel free to request me any required detail. The testing system is a MacBook Pro M2 Pro.

Many Thanks and Best Regards

Francesco

While performing read operations via the hw DMA (that is, a Host to Device transfer), I am noticing sparse latency spikes into the read transfers. Specifically, 4KB operations (which I assume include MRd + CpID) take normally from 5us to 40us to be completed, perfectly fine for my case. However, in some rare occasions, they can end up to 400us, which causes me overruns.

How rare is "rare"? The system is complicated enough that, given enough time/work/complexity, "something" is all but guaranteed to go wrong. If you can narrow the failure down to some set of specific conditions, then a deeper investigation could be useful, but without that context, it's hard to guess about what happened or even whether it was a true problem.

Having said that, the "4KB operations" did jump out at me. Is your hardware's normal work unit? Are you specifically preparing 4KB "chunks" as independent memory operations? If you are, then you might try operating on 16KB chunks, as that's the system’s natural page size, and sub-page mapping is more complicated for the DART to manage.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Latency critical DMA read via PCIe
 
 
Q