Latency critical DMA read via PCIe

Question

Created Jun ’26

Replies 9

Boosts 0

Participants 2

Dear All,

I am currently developing a high throughput audio system which operates via PCIe tunneled into a USB4 interface. This include a custom FPGA based hardware and custom Audio DriverKit driver.

While performing read operation via the hw DMA (that is a Host to Device transfer), I am noticing sparse latency spikes into the read transfers. Specifically, 4KB operations (which I assume including MRd + CpID) take normally from 5us to 40us to be completed, perfectly fine for my case. However, in some rare occasions, they can end up to 400us, which causes me overruns. The measurements have been carried out from the FPGA and they include the overall request and transfer time.

While trying to tackle the problem, I'm investigating the possible power saving options and performance constraint methods at my disposal. I currently use these methods to mitigate the problem.

ChangePowerState(kIOServicePowerCapabilityOn); SetPowerOverride(true); RequireMaxBusStall(kIOMaxBusStall25usec); CreatePMAssertion(kIOServicePMAssertionCPUBit | kIOServicePMAssertionForceFullWakeupBit, &ivars->PMAssertionID, false);

The buffers are currently about 16MB, single segment, 16KB aligned and, of course, "prepared" for DMA.

The system run for 3 hours without any overrun, but I'm not still fully convinced about its reliability. May someone provide me some comments on this? Are there profiling tools that I can use?

Feel free to request me any required detail. The testing system is a MacBook Pro M2 Pro.

Many Thanks and Best Regards

Francesco

Answer 1

DTS Engineer OP

Apple

Jun ’26

While performing read operations via the hw DMA (that is, a Host to Device transfer), I am noticing sparse latency spikes into the read transfers. Specifically, 4KB operations (which I assume include MRd + CpID) take normally from 5us to 40us to be completed, perfectly fine for my case. However, in some rare occasions, they can end up to 400us, which causes me overruns.

How rare is "rare"? The system is complicated enough that, given enough time/work/complexity, "something" is all but guaranteed to go wrong. If you can narrow the failure down to some set of specific conditions, then a deeper investigation could be useful, but without that context, it's hard to guess about what happened or even whether it was a true problem.

Having said that, the "4KB operations" did jump out at me. Is your hardware's normal work unit? Are you specifically preparing 4KB "chunks" as independent memory operations? If you are, then you might try operating on 16KB chunks, as that's the system’s natural page size, and sub-page mapping is more complicated for the DART to manage.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 2

frankcesco OP

Jun ’26

Thanks for the reply Kevin. My apologies for the too qualitative info. The device prototype has been just set up and I don't have enough good statistics yet. I currently would like to ensure that all the proper driver technologies have been put in place and I will then start a long run session.

Audio Buffers

Let me provide you more detail about the system and the tests carried out so far. I will present the methods concerning the D2H path only (the one affected by the latency spike). The write one is anyway completely equivalent.

Buffer allocation (in audio device init):

``
OSSharedPtr<IOBufferMemoryDescriptor>   m_input_io_ring_buffer; //into ivars

IOBufferMemoryDescriptor::Create(kIOMemoryDirectionIn, buffer_size_bytes, 0x4000, ivars->m_input_io_ring_buffer.attach());
``

Buffer memory mapping (in audio device StartIO):

__block OSSharedPtr<IOMemoryDescriptor> input_iomd;

input_iomd->CreateMapping(0, 0, 0, 0, 0, ivars->m_input_memory_map.attach());

In all tests, a 16384 audio sample buffer has been used. The total size depends on how many channels were interleaved. Particularly I tested a system with 16, 64 and 256 I/O audio channels, 48kHz, 32 bit integer format.

DMA Buffer Preparation


D2HSegmentsN = 1 // Single segment forced (so far)

IODMACommand::Create(ivars->pciDevice, kIODMACommandCreateNoOptions, &dmaSpecification, &dmaCommandD2H);

dmaCommandD2H->PrepareForDMA(kIODMACommandPrepareForDMANoOptions, D2H_memory_buffer_descriptor, 0, virtualD2HSegment.length, &mem_direction_flags, &D2HSegmentsN, physicalD2HSegment);

PCIe Device

Followed the same procedure presented in official Apple video for DMA bus mastering ("Modernize PCI and SCSI drivers with DriverKit").

// Enable memory space access and bus mastering for DMA
    ivars->pciDevice->ConfigurationRead16(kIOPCIConfigurationOffsetCommand, &commandReg);
    commandReg |= (kIOPCICommandBusMaster | kIOPCICommandMemorySpace);
    ivars->pciDevice->ConfigurationWrite16(kIOPCIConfigurationOffsetCommand, commandReg);

Performed Tests

Very First. No actions for CPU/DART/PCIe power management (all default), 16 Channels, single DMA burst at every audio sample (20.8us of deadline), that is 64 bytes (very inefficient). Frequent deadline misses (1 per minute) in the read operation. This is predictable since the baseline takes normally about ~20/25us -> abandoned approach.
Burst increased to 8 audio samples (that is 167us of deadline) and 16 interleaved channels (512 bytes). Better stability in operation (read baseline is still about 10 to 40us). However, 1 per 30 minutes c.ca I noticed a spike in the read exceeding the deadline -> host underrun (bad).
Same burst morphology but I applied power management + bus characteristic constraints. Particularly:

pciDevice->EnablePCIPowerManagement(kPCIPMCSPowerStateD0);

pciDevice->SetASPMState(kIOPCILinkControlASPMBitsDisabled);

//This looks very critical <<<<-------
RequireMaxBusStall(kIOMaxBusStall25usec);

plus, into Info.plist:

IOPCITunnelL1Enable NO
IOPMPCISleepLinkDisable NO 
IOPMPCIConfigSpaceVolatile NO
IOPCIRetrainLinkWake YES

Now things are much better and read deadline misses occurred only probably 3 times in 12 hours test.

Carried away by my enthusiasm, I tried an extreme test with 256 channels. The burst was of 8 or 4 samples, which indeed corresponds to 8KB or 4KB. The outcome seems very similar to case 3. But I’d like to eliminate the possibility of deadline misses entirely. So I went further on investigating about power features etc. I ended up adding this requirements before the audio IO op. start:

ChangePowerState(kIOServicePowerCapabilityOn);
SetPowerOverride(true);
CreatePMAssertion(kIOServicePMAssertionCPUBit | kIOServicePMAssertionForceFullWakeupBit, &ivars->PMAssertionID, false);

After this, in several days, I did not notice any relevant event and my question is if the problem has been really solved completely (?). I should probably try to comment the called method one by one and check what is the game changer. Am I doing some stupidities? Are some of these method redundant (probably yes). Are there other relevant methods I'm missing or some profile tools from the host system which I can use to track the system in long term?

All the cited measurements have been carried out by the FPGA itself, so they are reliable in term of precision.

Concerning your point of the 16KB, I know this is the page size, I can try to ask my DMA to produce such a burst. However, if I remember correctly, PCIe allows burst of 4KB maximum, so I don't know if this will help. I can try. Worth to study better if such a large request can be asked in a MRr, or a division In sub-chunks is unavoidable.

Thank you very much

Answer 3

DTS Engineer OP

Apple

Jun ’26

Hi,

Concerning your point of the 16KB, I know this is the page size. I can try to ask my DMA to produce such a burst. However, if I remember correctly, PCIe allows bursts of 4KB maximum, so I don't know if this will help. I can try. It’s worth studying better if such a large request can be asked in a MRr, or a division into sub-chunks is unavoidable.

So, the issue here isn't about the PCI bus, it's about how the DART manages mappings. The DART can do sub-page size mapping, but it's generally "easier" and faster when you're working in full page increments. That is, you're better off using 1 16Kb page vs 4 4Kb pages, even though the final result is exactly the same.

Note that this ISN'T really about what you actually send to your PCI card. If you need to work in 4Kb chunks (or any other weird size for that matter), then you can take that 16Kb page and use simple math to subdivide the physical offsets. This post is an example of how the performance can vary.

D2HSegmentsN = 1 // Single segment forced (so far)

No, not just so far. IODMACommand.PrepareForDMA() returns a segment count and an array of segments; however, that detail is effectively a vestigial appendage, not a useful feature. I have a post that describes what's going on in more detail here, but you're only going to ever get "1" segment back. I'd actually recommend that you check that segmentsCount==1 and simply terminate your driver if you get anything else, as a different value would imply significant enough architectural changes that "blindly" continuing is unwise.

Note that the underlying behavior here is effectively a fundamental side feature of the DART, not an accident or DriverKit-specific feature. Even within the kernel, it's not entirely clear to me how you'd get IODMACommand to generate multiple segments, and that's with a much broader set of memory descriptor functionality than DriverKit exposes.

Followed the same procedure presented in the official Apple video for DMA bus mastering ("Modernize PCI and SCSI drivers with DriverKit").

That's a good reference, particularly since SCSIControllerDriverKit is basically designed around subdividing I/O buffers. However, I'll also note that the IOPCIFamily's implementation (including DriverKit) is open source, so there is another resource you may find useful.

Carried away by my enthusiasm, I tried an extreme test with 256 channels. The burst was of 8 or 4 samples, which indeed corresponds to 8KB or 4KB. The outcome seems very similar to case 3. But I’d like to eliminate the possibility of deadline misses entirely.

So, the critical factors here are:

How fast you're attempting to perform operations.
How much data you're trying to transfer.

...but all of the transfers you're describing are sufficiently small that #1 is the primary factor. Note the dynamic here:

The burst was of 8 or 4 samples, which indeed corresponds to 8KB or 4KB. The outcome seems very similar to case 3.

...is exactly what I'd expect. That is, the bus has sufficient bandwidth that I'd expect the behavior to be indistinguishable all the way from 64 bytes -> 16+ KB. That entire range is basically "almost nothing". You can see that same dynamic in the storage post I mentioned earlier— at "bulk" scale, 16KB transfers were faster than 4KB transfers because the actual "time on PCI bus" was the same, but the DART was slower with smaller transfers.

Moving to here:

My question is if the problem has been really solved completely.

The word "completely" here is tricky. macOS is not designed around truly "guaranteed" I/O time, which means it's basically "always" possible to create circumstances where SOME kind of disruption will occur. As the most obvious example, it's hard to guarantee your transfer will occur in time if/when I pile enough other "stuff" on to the same bus. More practically, the "weak" link here tends to be shunting data to and from user space, not the PCI bus or your driver. There's not a lot your driver can do if user space ends up stalled for several minutes.

That reality is what things like the real-time threads exist; however, those have their own limits as well. The real-time thread can and will continue firing exactly on schedule, but you'll still lose data if/when it can't shunt data of that thread... because the VM shortage that's stalling the system is exactly the same reason it can't allocate memory.

The practical answer here is to ensure your transfer cadence is long enough that the system can reliably service that cadence. I can't provide hard numbers for that (audio is not my core area and there isn't really a fixed value), but the general guidance is that the less often you transfer data, the better it is.

I should probably try to comment the called method one by one and check what is the game changer. Am I doing some stupidities?

I don't see any obvious issue. Even the issues around preparing memory and that DART are less of an issue if you're reusing memory (which is how audio is typically handled).

Are there other relevant methods I'm missing or some profile tools from the host system which I can use to track the system in the long term?

The default answer here is Instrument, though I'll admit that I haven't actually used it all that much with DriverKit. However, it should be able to show you your interrupt cadence, as well as what other activity is occurring around that window.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 4

frankcesco OP

Jun ’26

Dear Kevin,

I had the opportunity to upgrade my DMA to allow bursting 16KB at each read/write operation on the mentioned, 16KB aligned, buffer. This translates to transfers of 16 samples, 256 audio channels (32bit).

It is not trivial to isolate performance improvement due to the larger bunch size (it was 4 or 8 samples earlier) from the lucky 16KB number itself, but it seems that the overall read operation time did not change (~30 to 50 us) even if the data amount is now doubled, so I'm happy of the result anyway. MRd to CpID takes about 15-40 us and data passing itself 5-10 us. The figure shows internal AXI transactions (write: yellow, read: cyan), which will then become PCIe TLP, in the mentioned conditions.

I successfully ran the prototype for 12 hours and everything worked like a charm. Then, I decided to try using another USB-C on the MacBook, just to exclude possible routing performance discrepancies. So I moved from the single one on the right (close to the HDMI) to the left one, close to the MagSafe. Procedures went fine for 20 minutes even if read times were noticeably longer (~40-60us) but, then, I unfortunately got a 350us spike in the read, which caused over/underrun. I repeated the test and another one occurred. At the third test, baseline times went smaller and no problem occurred for the next 2 hours. I also tried a reboot, but still the same good behavior, like as a self-training machine was operating under the hood. I was not able to reproduce the problem so far.

Thinking a bit about the issue, I noticed that I probably have included in my driver measures against aggressive CPU and PCIe power management, but not covering the USB4 layer at all. Have you some indication about this?

Replying to the remaining questions. I clearly see that pushing my device throughput further, will force me to come to terms with user client layer and upper. From one point of view, the extrapolation of some overall numbers could be considered part of the experiment. On the other hand, everything on top of the driver is more CoreAudio / HAL people job. I think that trying to push my own system (driver/hardware) performance to the limit is still worth. Applications can eventually go beyond audio itself. Having said that, my target is just in the configuration and specs that I already mentioned, no more for the moment.

Instead, talking about the non hard-realtime nature of the OS. That is completely clear. I know that the exception, or deadline miss, is just round the corner. Currently, my system is protected against single deadline miss. If, for instance, the read is not completed into the deadline, the new read remains pending and it is completed as soon as the first one finishes. This is ok but it cannot help in case more than a period is exceeded. I will implement protections which will eventually just skip packets and resume a clean stream in catastrophic conditions. But this will, and has to, be implemented in a second moment. Now, let's just trying to tackle something that is relatively uncommon, but definitely not extraordinary.

I can provide any other info, test result, if required.

Thank you very much for your time and support.

Answer 5

DTS Engineer OP

Apple

Jun ’26

I had the opportunity to upgrade my DMA to allow bursting 16KB at each read/write operation on the mentioned, 16KB aligned, buffer. This translates to transfers of 16 samples, 256 audio channels (32-bit).

It is not trivial to isolate performance improvement due to the larger bunch size (it was 4 or 8 samples earlier) from the lucky 16KB number itself, but it seems that the overall read operation time did not change (~30 to 50 us) even if the data amount is now doubled, so I'm happy with the result anyway.

That may sound strange, but the rough math says that's only ~300 MB/s, which isn't a lot of data on Thunderbolt. However, I'm also not sure why you need to be doing ~20,000 reads/s.

Thinking a bit about the issue, I noticed that I probably have included in my driver measures against aggressive CPU and PCIe power management, but not covering the USB4 layer at all. Have you some indication about this?

So, quick clarification here. USB-C has made the world an exceedingly interesting place because it's basically invented a plug specification that's totally independent of its parent "bus" implementation. Thunderbolt over USB-C is still just... Thunderbolt. That is, the first thing that happened when the USB-C connector was plugged in was that everyone agreed to "talk Thunderbolt", at which point the entire USB-C spec was ignored and the bus became a Thunderbolt bus.

HOWEVER, it's possible this mattered:

...So I moved from the single one on the right (close to the HDMI) to the left one,

Many of our machines have multiple Thunderbolt buses, and the way Thunderbolt channels are shared with video means that there could be big swings in bandwidth depending on what else is on the bus.

That leads to here:

Replying to the remaining questions. I clearly see that pushing my device throughput further will force me to come to terms with the user client layer and upper. From one point of view, the extrapolation of some overall numbers could be considered part of the experiment. On the other hand, everything on top of the driver is more CoreAudio / HAL people’s job.

The "pipeline" that moves data from user space to your driver is going to be the critical bottleneck here, which is what ultimately pushes devices to move more data using fewer transfers. It's possible that the real-time thread MIGHT be able to keep up with your operation count, but if normal threads can’t, then all that means is that the real-time thread ends up spending all its time subdividing larger buffers that are waiting to be processed. There's no reason for your real-time thread to subdivide the data instead of just sending "all" of it... at which point your driver is now spending its time segmenting and sending. And if it's going to do that... why not just send more data using fewer transfers, since that's the best way to improve both reliability and overall efficiency.

Putting this in more concrete terms, how many operations per second do you actually NEED? As one example, the shortest detectable audio latency is normally put in the range of ~5ms[1], which translates to 200 op/s. That's FAR fewer than what 50 us would imply/require.

[1] And I really do mean "shortest". "Acceptable" audio latency is significantly higher, often MUCH higher.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 6

frankcesco OP

Jun ’26

Thunderbolt over USB-C is still just... Thunderbolt. That is, the first thing that happened when the USB-C connector was plugged in was that everyone agreed to "talk Thunderbolt", at which point the entire USB-C spec was ignored and the bus became a Thunderbolt bus.

Ok, that's interesting. I only want to clarify that I'm using an ASMEDIA ASM2464PDX, which is mentioned as a USB4 device (no Thunderbolt Certification Logo). I was wondering if still some link layer features apply. In practical terms, I noticed:

LPM policies in the USB4 controller

ioreg -l -p IOService -w 0 | grep -E "LPMPolicy|USB4LPM"
    | |   | | |   "UsbHostControllerUSB4LPMPolicy" = 1
    | |   | | |   "UsbHostControllerUSB4LPMPolicy" = 1
    | |   | | |   "UsbHostControllerUSB4LPMPolicy" = 1

ioreg -l -p IOService -w 0 | grep -A 20 "AppleSynopsysUSB40XHCI" | grep -E "kControllerStat|CurrentPowerState|DevicePowerState|LPMPolicy"

reporting very low percentage of kPowerStateOn in my device and, remarkably, a CurrentPowerState transition which appears to have taken place in the same time window of the deadline miss.

In particular 2. is probably not negligible and maybe worth further investigation.

That may sound strange, but the rough math says that's only ~300 MB/s, which isn't a lot of data on Thunderbolt.

True. This is, as you mentioned, the result of the high number of non-posted read operation overhead. Instead, note that the "posted" write operations are here in the ~3.8GB/s area, even if the transition rate is the same.

Now let's talk about audio.

However, I'm also not sure why you need to be doing ~20,000 reads/s.

and

the shortest detectable audio latency is normally put in the range of ~5ms[1], which translates to 200 op/s. That's FAR fewer than what 50 us would imply/require. [1] And I really do mean "shortest". "Acceptable" audio latency is significantly higher, often MUCH higher.

Let's clarify this together with the latency related discussion. First, the specs that you mentioned are well exceeded by modern audio cards. Many high-end brands achieve <2ms roundtrip latency, so the problem solution exists. Is this low-latency needed? In several occasion Yes. As a musician, as well as an engineer, I can easily tell whether my sound card is running at 2 or 5 ms during a live performance. 10 ms starts to become even annoying if the monitoring / PA system is close to the performer. So, I don’t think the aim here is to question the specifications of our project.

Concerning the scheduling and HAL pipeline. I see you point. The real-time scheduled driver thread is a thing and HAL thread is another, but I can guarantee after many years of experience in the field, that CoreAudio and upper level pro audio applications succeed in sustaining low latencies even in the range of ~1.5ms (that is e.g. 64 samples at 48kHz) without glitches. You might say, ‘Not at 256 channels.’ That’s probably true. But with dozens of channels, it certainly is. And the high number is down more to routing flexibility than to a need for concurrent use. So my goals are demanding but not SciFi. Furthermore, I probably was not enough clear, but in all these tests I, of course, have employed the entire HAL layer. In fact, data are checked via a custom user application or on pro-DAW working at 32 or 64 HAL buffer sizes. Never reported glitches except in the mentioned events.

Having said that, I think that we all agree that larger, and less frequent, transfers would reduce read overhead, but I cannot increase them too much for the exposed reasons. I can maybe go up to 32, to try matching the minimum HAL buffer, but not higher than that. What I currently do is just matching the safety margin.


#define BUFFER_SAFETY_MARGIN (16)
    SetInputSafetyOffset(BUFFER_SAFETY_MARGIN);
    SetOutputSafetyOffset(BUFFER_SAFETY_MARGIN);

This comes from the fact that such a value should be, in my opinion (which can be wrong), equivalent to the synchronization uncertainty between HAL and DMA buffer pointers.

Wrapping the results of our last test experiment: 256 I/O, 16 sample burst @192kHZ (a deadline of 83.3us). We report that:

Read takes in average about the 42% of the deadline and the deadline miss is rare (<1e-10). Write time is negligible and stable, even in the case of spike -> Info which can be important in the investigation!
Deadline misses causes not predictable high read time spikes (>350us).
Throughput and overall performance comply with the project requirements.

Given that I state that:

We do not need te deploy actions which increase the throughput.
We cannot just increase the transfer size due to latency requirement. Such increase could even not been engineered properly due to the non predictable spike read duration ( we will not have a number of the presumably safe buffering size, even 1M samples cannot be sufficient to tackle the problem with such a methodology ).

So, the work focuses to the tasks:

Deploy failsafe logic into the FPGA (e.g. skip samples in case of deadline misses)
Understand the nature of such a spike (DART, Power Management etc.) and deploy all the features that macOS provides us to avoid / minimise it. I mean, if this system works correctly for the 99.999% of the time, there is for sure an Apple Engineer which can tell me why in that 0.001% my read takes 10 times the usual time. I'm sure the cause can be found and tackled. This is not a cosmic ray bit-flipping my DDR, it is in some way a system decision.

I'm an expert in 1., but I need help for 2. providing whatever code / measurement required from your side.

Answer 7

DTS Engineer OP

Apple

Jun ’26

OK, that's interesting. I only want to clarify that I'm using an ASMEDIA ASM2464PDX, which is mentioned as a USB4 device (no Thunderbolt Certification Logo).

Assuming you're talking about this product:

https://developer.apple.com/forums/thread/831333?page=1

"ASM2464PDX is a new generation of USB4/Thunderbolt to PCIe/NVMe Accessory controller based on ASMedia in-house designed PHYs."

...then I believe it's capable of functioning in either mode. More to the point, I believe you're building on IOPCIDevice, which means you're using Thunderbolt, not USB.

LPM policies in the USB4 controller

So, as a general comment, I'm not a huge fan of "ioreg", particularly when you start poking at its contents with tools like "grep". The core problem here is that the IORegistry is a tree of objects communicating with each other, which means the structure of the hierarchy is as important, if not more so, than the individual objects. Interacting with it as pure text tends to obscure that structure, making it easy to misunderstand or confuse what's going on.

I'll sometimes extract the full structure as XML using:

ioreg -la > <output file>

...but my actual preference is to use IORegistryExplorer.app, as I think it does a much better job of conveying what's actually going on. See this forum post for download instructions and general guidance on using it.

In terms of the two specific snippets you posted, there isn't enough context to know what you're looking at, but I don't think it's relevant.

First, the specs that you mentioned are well exceeded by modern audio cards. Many high-end brands achieve <2ms roundtrip latency, so the problem solution exists.

Sure. The number I posted was for "full" latency ("my mouth to your ear"), so the intermediate hardware needs to have a latency below that.

The real-time scheduled driver thread is a thing and HAL thread is another, but I can guarantee after many years of experience in the field, that CoreAudio and upper level pro audio applications succeed in sustaining low latencies even in the range of ~1.5ms (that is e.g. 64 samples at 48kHz) without glitches.

The problem here is that you didn't say "1ms", you said "(~30 to 50 μs)". 1ms is 1000μs. Similarly, the "spike" you're describing here:

Deadline misses cause not predictable high read time spikes (>350us).

...is ~1/4 of the 1.5ms CoreAudio latency number you just quoted. Framing all of this is a different way, what's the actual maximum acceptable latency of your entire "system“? My concern here is that you seem to be trying to run at a frequency that's far faster than the larger system, which is going to unnecessarily make the entire system less reliable.

I can maybe go up to 32, to try matching the minimum HAL buffer, but not higher than that.

Why would you want to be smaller than the minimum HAL buffer?

I mean, if this system works correctly for the 99.999% of the time, there is for sure an Apple Engineer which can tell me why in that 0.001% my read takes 10 times the usual time. I'm sure the cause can be found and tackled.

Let me ask another question first. What else was your system doing during your experiment(s)? My concern is that your answer is going to be "not very much", which means there's a big issue you haven't really considered.

The fundamental problem here is that the large system doesn't actually offer much in the way of "strong" guarantees around scheduling. The very lowest level hardware interface does allow relatively "narrow" timing and the real time thread does offer the strongest guarantee the system offers, but across ALL level systems the basic goal is "do its best to do as much as possible".

On an idle system, that makes this:

I mean, if this system works correctly for the 99.999% of the time

...pretty trivial. That is, the system isn't "doing anything", so as soon as you give it "something" to do, it immediately does it. That works really great until "something" delays things, causing things like this:

...in that 0.001% my read takes 10 times the usual time

It's possible this is true:

there is for sure an Apple Engineer which can tell me why

...but it isn't me and it's also much harder than you might think. There's an enormous amount happening within the system and, as I noted above, the system isn't really trying to organize its work in a way that provides any kind of strong scheduling guarantee. Adding to the fun, conventional debugging tools like logging and even dtrace can be counterproductive, due to the disruption they introduce.

However, the much bigger issue is overall system load, because whatever issues you're having are VERY likely to become MUCH more common once you start to load the system. The specific cause here doesn't matter all that much if this is going to happen 20x more often once you load the system.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 8

frankcesco OP

Jun ’26

Dear Kevin,

give me a bit of time to think about all your points. I just want to highlight that the numbers I provided do not change even if, for instance, I run a stress test in the meanwhile, or I run a Geekbench. I also tried to use the GPU intensively in the meanwhile. That 30-50us baseline is always there. This seems quite reasonable since the net bandwidth used on DDR by the DMA is very little in comparison to its capability (200MBps vs 200GBps). So I would not be too concerned about the HAL pipeline at the moment. If the load will be too high, then CoreAudio callbacks or whatever will probably overrun in the user application, but I do not believe this will affect the DMA transactions themselves (and the driver, which does not do actually nothing on the data, no copy).

The problem here is that you didn't say "1ms", you said "(~30 to 50 μs)". 1ms is 1000μs. Similarly, the "spike" you're describing here:

30-50us is the average read duration, not the deadline. The deadline in my case is due to the 16 samples, which at 192kHz is 83us, at 48kHz 333us etc. Let's consider instead that we want 1ms, that is 192 samples. But HAL buffer has a minimum of 32, so that will not work. I would have 32,64,96,128 unavailable by design. Finally, the question concerning why not using the same size as HAL. Different reason:

Paradoxically, HAL buffer size is not available until StartIO is launched. And, even in that case, its real value is available only via the io_operation (in_io_buffer_frame_size). So I would have to tune my DMA transaction while it is already operating on buffer, weird.

								  IOUserAudioIOOperation in_io_operation,
								  uint32_t in_io_buffer_frame_size,
								  uint64_t in_sample_time,
								  uint64_t in_host_time)

Can I really use the same size as HAL buffer keeping synchronization ok? For this I have to do a bit my exercises. Look at this diagnostics I have set in console:

default	20:29:37.205137+0200	kernel	DiagnosticTimerOccurred_Impl: Host Out - HW: 48.000000, HW - Host In: 112.000000

So, here you se the HAL pointer - the DMA pointer (for outputs, that are DMA read) and the counterpart. In this example I set a 64 sample HAL buffer + 16 safety margin. 48 is less than 64, so it means that, if my DMA fires 64 sample in the upcoming future, it will probably overlap HAL pointer -> bad. So I’d like to tread a bit more carefully such an approach. If possible, I will of course implement it.

...then I believe it's capable of functioning in either mode. More to the point, I believe you're building on IOPCIDevice, which means you're using Thunderbolt, not USB.

Yes the device is exactly the one you mentioned and, true, I'm building on IOPCIDevice. So ok what you said. I just want to be sure that my Power Management directives are correctly propagated to all parents and, lower level, components.

Answer 9

frankcesco OP

Jun ’26

So, I have an important news, but let me reply in order.

Preamble

Yes, I used it intensively. It is, at least, a bit more easier to navigate the ioreg tree. Even if sometimes crashes.

IORegistryExplorer.app

I think that, in a certain way, we have finally agreed on the audio latency aspects. We are in a R&D phase, so we are experimenting latencies lower than what the market offers: 0.5 to 1ms roundtrip . But, if you want me to provide you a spec. I would tell you 1.8ms roundtrip.

So, as I already mentioned in my previous reply, I don't think that OS scheduling or overall system load matter to a significant extent in this investigation, and measurements under load prove it. Clearly, the concurrent DDR access will cause jitter in read/write operation, but these seem to be inside the margins that I already considered.

Now, the Plot Twist

I decided to extensively log ioreg changes and console output during the issue occurrences and, completely unexpectedly, I discovered these kind of events.

kernel  [ACIO2:high_speed_lane.c:289] Gen2/3 link error. lane=0, error=83
kernel  [ACIO2:high_speed_lane.c:289] Gen2/3 link error. lane=1, error=83

I therefore tried to improve my statistics by logging longer, and I discovered that those kinds of errors always preceded missed reading deadlines. No missed deadlines, no link errors – and vice versa. This means that the cause is much more lower level than what we expected. Perhaps, ironically, the cursed spectre of my previous job as a signal integrity engineer at CERN.

The question now is, does this happen between FPGA to USB4 Controller or USB4 Controller to Mac? I don't think that I saw this high_speed_lane.c code into the open source IOPCI, but I might be wrong. Do you know something about it? I also didn't designed the PCB myself, this will be performed in the upcoming future. So, I don't really have control of what and how well things were managed there. This was just an AliExpress ADT-UT3G adapter that we adopted between a good PCIe FPGA card and the Mac (and, perhaps not insignificantly ^^, with the cable that came free of charge).

I still require more time to investigate it but I noticed that, even a hard abruption of the FPGA does not produce such a kind of error. Instead, disconnecting the USB-C does. At the moment, I decided to try running with an older USB3.2 cable, which force negotiation to 20Gbps instead of 40Gbps, while still keeping the PCIe tunnelling alive. With that I did not see such issues anymore. Hence, the problem was probably due to a SI issue in the cable or between USB4 controller to USB-C connector.

It’s been quite a turbulent story, but perhaps we’ve managed to find some answers. I’d like to summarise the whole strategy I used and include it in this post for future reference.