Lock Contention in APFS/Kernel?

Hello! Some colleagues and work on Jujutsu, a version control system compatible with git, and I think we've uncovered a potential lock contention bug in either APFS or the Darwin kernel. There are four contributing factors to us thinking this is related to APFS or the Kernel:

  • jj's testsuite uses nextest, a test runner for Rust that spawns each individual test as a separate process.
  • The testsuite slowed down by a factor of ~5x on macOS after jj started using fsync. The slowdown increases as additional cores are allocated.
  • A similar slowdown did not occur on ext4.
  • Similar performance issues were reported in the past by a former Mercurial maintainer: https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/.

My friend and colleague André has measured the test suite on an M3 Ultra with both a ramdisk and a traditional SSD and produced this graph:

(The most thorough writeup is the discussion on this pull request.)

I know I should file a feedback/bug report, but before I do, I'm struggling with profiling and finding kernel/APFS frames in my profiles so that I can properly attribute the cause of this apparent lock contention. Naively, I ran xctrace record --template 'Time Profiler' --output output.trace --launch /Users/dbarsky/.cargo/bin/cargo-nextest nextest run, and while that detected all processes spawned by nextest, it didn't record all processes as part of the same inspectable profile and didn't really show any frames from the kernel/APFS—I had to select individual processes. So I don't waste people's time and so that I can point a frame/smoking gun in the right system, how can I can use instruments to profile where the kernel and/or APFS are spending its time? Do I need to disable SIP?

There's a System Trace template that is a better starting point than Time Profiler for your needs here.

— Ed Ford,  DTS Engineer

Let me start with the comparison here:

Similar performance issues were reported in the past by a former Mercurial maintainer: https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/.

Unfortunately, there are some fairly significant flaws in the analysis there. Picking one of his data points:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 172.209s

real    2m52.470s
user    1m54.053s
sys    23m42.808s

Based on his overall article, I believe he thinks that test took ~23m when in fact it took ~3m. That is, the actual test time was:

...in 172.209s -> ~2.52 ->
real    2m52.470s

In general, that's what "real" means in most time tests -> "how long did this test actually take".

In terms of the difference in system time, that appears to have been due to an accounting issue in the kernel. Per the investigation that was done on the bug, this particular test script ended up having multiple threads blocked in the kernel (waiting on work to complete) and the kernel included those times in sys time when it probably should not have.

The full details of that investigation aren't something I can really summarize, the rough summary is that some of the difference was due to fundamental design differences between Linux and macOS (things like the sandbox and how much memory the I/O system was allowed to consume) and some of it was due to increased lock contention with increased core count. Everything found was largely addressed in macOS 10.14 (Mojave).

Moving to your specific concerns:

A similar slowdown did not occur on ext4.

Unfortunately, comparisons between different systems are extremely tricky due to "fundamental design differences.” One of macOS’s fundamental design choices is that its goal is to "responsive" NOT "fast.” In other words, the system’s ability to quickly react to the user’s actions is MORE important than the total time it takes to complete any given operation. In the I/O system, that means restricting I/O memory commitments more than Linux, both to reserve memory for other purposes and to reduce the chance that the system will have to stall waiting for memory to be flushed to disk.

Next here:

The testsuite slowed down by a factor of ~5x on macOS after jj started using fsync.

A few questions:

  • Are you calling "fsync" or are you doing fcntl(F_FULLFSYNC)? The second call will be significantly slower, but if you need the data to actually reach the disk, then you need to use "F_FULLFSYNC" (see man fsync(2) and fcntl(2) for more details).

  • To some extent, either call will exhibit a certain degree of "lock-like" behavior, both to stabilize the state of the file system at a particular "point" and to prevent the data that's actively being "stabilized" from being modified.

  • If all of these tests are interacting with the same dataset, then that obviously exacerbates the issue.

So I don't waste people's time and so that I can point a frame/smoking gun in the right system, how can I use instruments to profile where the kernel and/or APFS are spending its time? Do I need to disable SIP?

Ed's suggestion to use System Trace is a good one; however, the most useful thing you can do is provide tests we can run ourselves. The reality is that even if when a system trace clearly demonstrates "a problem", we generally need to reproduce the problem ourselves so that we can confirm any fix actually does what we think it will.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks y'all! Ed, I'll try running the system trace tomorrow morning (I'm in New York City, so it's the end of the day for me).

Kevin:

Everything found was largely addressed in macOS 10.14 (Mojave).

Ah! Good to know that my information is kinda out of date :)

Unfortunately, comparisons between different systems are extremely tricky due to "fundamental design differences.” One of macOS’s fundamental design choices is that its goal is to "responsive" NOT "fast.” In other words, the system’s ability to quickly react to the user’s actions is MORE important than the total time it takes to complete any given operation. In the I/O system, that means restricting I/O memory commitments more than Linux, both to reserve memory for other purposes and to reduce the chance that the system will have to stall waiting for memory to be flushed to disk.

Understood! Makes a lot of sense given the common, respective usages of Linux vs. macOS!

Are you calling "fsync" or are you doing fcntl(F_FULLFSYNC)? The second call will be significantly slower, but if you need the data to actually reach the disk, then you need to use "F_FULLFSYNC" (see man fsync(2) and fcntl(2) for more details).

I'm not sure if we're using fsync or fcntl(F_FULLFSYNC), but I'll followup and give an answer tomorrow.

If all of these tests are interacting with the same dataset, then that obviously exacerbates the issue.

The tests shouldn't be interacting with the same logical data set, we're just using tmp as an on-disk location.

however, the most useful thing you can do is provide tests we can run ourselves.

It's not too bad:

  1. I'm guessing y'all have the clang/the Xcode command line tools installed, but if you don't, you need those!
  2. Install Rust. I recommend using https://rustup.rs/, which will install the Rust compiler and Cargo at ~/.rustup/ and ~/.cargo. It will also add those paths to your $PATH.
  3. Clone jj: git clone https://github.com/jj-vcs/jj.git
  4. Install the per-process test runner we use, cargo-nextest, by running cargo install cargo-nextest --locked. cargo-nextest should now be in your $PATH, installed at ~/.cargo/bin/cargo-nextest.
  5. Inside the cloned jj repo, run cargo nextest run --no-fail-fast.

The above should be sufficient to replicate the slowness we've observed, but let me know if you run into any issues.

I'm not sure if we're using fsync or fcntl(F_FULLFSYNC), but I'll follow up and give an answer tomorrow.

Based on the general description, I suspect it’s F_FULLFSYNC. F_FULLFSYNC asks the target device to flush its hardware cache, which both takes longer and has "broader" consequences. That doesn't make it "wrong" (the right choice is "the right choice"), but it does have performance consequences.

The above should be sufficient to replicate the slowness we've observed, but let me know if you run into any issues.

So, the next step here would be to file a bug that includes a system trace of the issue, a sysdiagnose taken immediately after you finish your test run, and the reproduction steps above. Please post the bug number back here once it's filed.

However, one note to set "expectations". Changing anything in the file system is very high risk, which means we tend to only use software updates to ship changes that:

  1. Address critical bugs.

  2. Are small changes which provide significant benefit and have minimal risk and are easy to validate.

Everything else we hold off shipping until the next major release to ensure that those changes have as much testing time as possible. That's the best way to protect our users’ data, but it does mean that it can often look like file system issues take a particularly long time to address.

In terms of addressing your immediate issue:

The tests shouldn't be interacting with the same logical data set; we're just using tmp as an on-disk location.

Have you looked at using separate volumes to further separate the data sets? Perhaps using a DiskImage with multiple APFS volumes as the target?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Based on the general description, I suspect it’s F_FULLFSYNC. F_FULLFSYNC asks the target device to flush its hardware cache, which both takes longer and has "broader" consequences. That doesn't make it "wrong" (the right choice is "the right choice"), but it does have performance consequences.

Yup, it is calling F_FULLFSYNC.

Tracing through the code, jj starts by calling std::fs::File::sync_data. This methods then calls sync_data on its private which, in turn, calls libc::fcntl(fd, libc::F_FULLFSYNC), but only for target triple vendors that are Apple. On a bunch of Unixes, std will instead call fdatasync, which does not appear to be implemented on Darwin, which, fair enough.

However, one note to set "expectations". [...] Everything else we hold off shipping until the next major release to ensure that those changes have as much testing time as possible. That's the best way to protect our users’ data, but it does mean that it can often look like file system issues take a particularly long time to address.

No worries at all! I don't expect this to be fixed any time soon and I know I'm asking for potentially a very large change if y'all are able to reproduce this properly.

Have you looked at using separate volumes to further separate the data sets? Perhaps using a DiskImage with multiple APFS volumes as the target?

We haven't, but that's certainly an option to explore, especially it doesn't require the end-user to approve/run anything! So far, we've found success by having developers on macOS run:

sudo mkdir -p /Volumes/RAMDisk
sudo chmod a+wx /Volumes/RAMDisk
sudo mount_tmpfs -e /Volumes/RAMDisk

...and running tests as env TMPDIR=/Volumes/RAMDisk cargo nextest run, but that doesn't Just Work™ out of the box. On an M3 Ultra, this brings our test runtimes back to to around 42.9s when the test runner is given 12 cores (above 12 cores, test runtimes increase, as you can see on the lower, blue graph in my initial post.). As most of us are on M{2, 3, 4} Pros instead of the Max or Ultra variants, this works out surprisingly nicely.

So, the next step here would be to file a bug that includes a system trace of the issue, a sysdiagnose taken immediately after you finish your test run, and the reproduction steps above. Please post the bug number back here once it's filed.

Can do! That'll probably be a tomorrow morning thing.

I'm sorry for the delay in submitting a feedback item; a cross-country move happened. I've submitted FB21093202.

I'm sorry for the delay in submitting a feedback item; a cross-country move happened.

Not a problem, I've got plenty to keep my busy!

That leads to here:

I've submitted FB21093202.

So, looking over the bug, there are a few things I want to clarify:

  • F_FULLFSYNC is basically "always" going to be relatively slow. The API operates on a file handle, but flushing the volume to a coherent state is a broader operation the final I/O command is a device command. All of that makes it a fairly "heavy" operation.

  • On the APFS side, the specific concern I have here is about the performance dip at high core count. That's partly because of the immediate issue and mostly because our core count has been increasing, and we need to be watching for these concurrency bottlenecks.

  • Cross platform comparisons are very tricky here because the platforms are (ideally) making different design decisions based on the specifics of of their particular use case and user base. For example, in an enterprise/server environment, there's an argument that F_FULLFSYNC is largely superfluous. The major data risk comes from sudden power loss destroying the uncommitted data in the drive cache, but that isn't really an issue with a large enough UPS or the "right" drive hardware. However, macOS (particularly in the past) basically "lived" at the end of that spectrum, with lots of machines having multiple inexpensive drives attached, inevitably leading to data loss.

Expanding on that last point, there's a real danger in these comparisons that comes from assuming that both implementations are directly "equivalent", so any performance divergence comes from correctable issues in the other platform. That certainly true some of the time, but it's definitely NOT true here.

Finally, a comment on your code side:

Tracing through the code, jj starts by calling std::fs::File::sync_data.

Asking a question that probably should have been asked earlier... why? Why are you doing this at all? Unless you're applying some external force/factor (basically, cutting power to the drive), I think all of these sync calls are really just slowing you down. That's true of ALL platforms, not just macOS. If your code is doing it as part of it's own internal logic then that fine, but if this is actually part of your testing infrastructure then I'm not sure you shouldn't just "stop".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

F_FULLFSYNC is basically "always" going to be relatively slow. The API operates on a file handle, but flushing the volume to a coherent state is a broader operation the final I/O command is a device command. All of that makes it a fairly "heavy" operation.

Gotcha, thanks for the background!

On that note, I did my research—to our mutual peril—and found a comment (https://news.ycombinator.com/item?id=25204202) that cited sqlite's codebase, noting that "fdatasync() on HFS+ doesn't yet flush the file size if it changed correctly". Some folks on the postgres mailing list also appear to be a bit uncertain about the current situation: https://www.postgresql.org/message-id/flat/CA%2BhUKGLv-kvrtA5QEL%3Dp%3DdYK0p9gsMXJaVhUFu%2BA-KyFrFi%3D2g%40mail.gmail.com#fe6d1c5665a381687842758fd5b245d4.

I'm a newbie when it comes to understanding C code, but looking over kern_aio.c, I see two seemingly contradictory comments.

On lines 658–659:

* NOTE - we do not support op O_DSYNC at this point since we do not support the
* fdatasync() call.

Later, on lines 721–727, above aio_return:

/*
 * aio_return - return the return status associated with the async IO
 * request referred to by uap->aiocbp.  The return status is the value
 * that would be returned by corresponding IO request (read, write,
 * fdatasync, or sync).  This is where we release kernel resources
 * held for async IO call associated with the given aiocb pointer.
 */

So I guess I'm wondering: is it... fine to now use fdatasync on APFS? Because if it is now fine (as per sqlite's understanding via the Hacker News comment...), then I think there's a bunch of software might be relying on outdated documentation/advice, since:

  • man fsync, on macOS 26.1 refers to a drive's "platters". To the best of my knowledge, my MacBook Pro does not have any platters!
  • As of 2022, it appears that Apple's patched version of SQLite uses F_BARRIERFSYNC. The wording of the documentation, at least for iOS, suggests that file data size would be synced to disk.
  • Foundation's FileHandle (which is, I think, equivalent to Rust's std::fs::File?) uses a plain fsync(_fd), not an F_FULLFSYNC like Rust's (and Go's, for that matter!) standard libraries do.

On the APFS side, the specific concern I have here is about the performance dip at high core count. That's partly because of the immediate issue and mostly because our core count has been increasing, and we need to be watching for these concurrency bottlenecks.

Understood. My friend André—who produced the graph on his M3 Ultra Mac Studio—theorized this past weekend that part of the observed performance degradation was partly due to the interconnects between the M3 Maxes, but this was idle speculation over brunch. He, to your point, also noted that the base Apple Silicon chips went from 8 cores to 12 cores in the span of 5 years, with roughly 20% year-over-year performance improvements! That'll certainly stress design assumptions!

Expanding on that last point, there's a real danger in these comparisons that comes from assuming that both implementations are directly "equivalent", so any performance divergence comes from correctable issues in the other platform. That certainly true some of the time, but it's definitely NOT true here.

Understood! If the answer from your end is "you're hitting a pathological edge case on APFS doing something it wasn't really designed for", then that's fine! It'd be nice if I can have my cake and eat it too à la ext4, but as you mentioned previously, there's a latency/throughput tradeoff here, and APFS is firmly on the side of "latency".

(And just to wave my credentials/sympathy for APFS' position: I spent four years of my life working on a latency-sensitive reimplementation of the Rust compiler, so I get how fundamental these design tradeoffs are!)

Asking a question that probably should have been asked earlier... why? Why are you doing this at all? Unless you're applying some external force/factor (basically, cutting power to the drive), I think all of these sync calls are really just slowing you down. That's true of ALL platforms, not just macOS. If your code is doing it as part of it's own internal logic then that fine, but if this is actually part of your testing infrastructure then I'm not sure you shouldn't just "stop".

No, that's a great point to clarify! The fsyncs are not part of the tests—at least, not directly—but rather, a part of core logic of jj itself. The change was introduced in this commit and has reduced the frequency of people reporting data corruption issues.

(The "op_store" in the linked commit can be thought of as a "write ahead log, but for the files you're keeping under version control". We could probably restore from the op log in the event of data corruption, now that I think of it...)

So I guess I'm wondering: is it... fine to now use fdatasync on APFS? Because if it is now fine (as per sqlite's understanding via the Hacker News comment...), then I think there's a bunch of software might be relying on outdated documentation/advice, since:

So, the first thing to understand is that what led to "F_FULLFSYNC" wasn't really the kernel itself's handling of data. The basic issue was that the kernel itself would send the data all the way "to disk"... and which point the drive itself would stick the data in a write cache and leave it there for an indeterminate but "long" time. Even worse:

  • Some of these drives were intentionally ignoring the existing commands (and more performant) that were "supposed" to flush the specific data.

  • These issue were widespread enough across product and vendors that blaming hardware wasn't really feasible.

The only solution we found viable was what F_FULLFSYNC does, which is to flush the data and the force the drive to commit the data. I'll also note that issue above was never specific to macOS, we just happened to create API that were intended to resolve it.

That leads to here:

So I guess I'm wondering: is it... fine to now use fdatasync on APFS? Because if it is now fine

Depends on what you mean by "fine". On the one hand, much of the hardware that caused these issues still exists and is likely still being shipped. On the other hand, there are other factor's that certainly mitigate or avoid it.

Shifting to specifics:

man fsync, on macOS 26.1 refers to a drive's "platters".

While spinning disk certainly make the problem easier to reproduce, a simple search fo"ssdSD write caching data loss" will show that the problem is not unique to platter media. Unfortunately, the basic incentive that created this situation remains the same. The simplest way to make a device "seem" faster is to stick some memory in front of it and cache writes. The simplest way to make that product cheaper inotto worry about power loss.

To the best of my knowledge, my MacBook Pro does not have any platters!

That'true,e; however, the problem waneerer really our local storage but external storage. More to the point, laptops (particularly with only local storage) were never really the issue. At a hardware level, the easiest way to avoid this issue is to attach a battery sthe thehe device never loses power. If your device never loses power, it doesn't matter how long data sits in cache.

Case in point:

As of 2022, it appears that Apple's patched version of SQLite uses F_BARRIERFSYNC. The wording of the documentation, at least for iOS,

...iOS is just the more "extreme" version of the same issue. Kernel panics and hammers aside, the device can't really "lose" power. If you let the battery drain to "empty", then the system will eventually power down normally, flushing all data.

Foundation's FileHandle (which is, I think, equivalent to Rust's std::fs::File?) uses a plain fsync(_fd), not an F_FULLFSYNC like Rust's (and Go's, for that matter!) standard libraries do

Yes, they do. For lots of use cases, even that fsyninis unnecessary. I think one thing to understand is that this comment from the fsync man page isn't a throwaway "note":

"For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl."

Lots of apps don't really need to worry about thitooo much. As the most notable example, our recommended save pattern is to:

  1. Create a new file as your starting point (often thimeanns "copy/clone the original").

  2. Write all changes to that new file.

  3. Atomically exchange the old and the new file.

  4. Delete the original.

In practice, it's fairly hard to ACTUALLY lose data with that sequence. That is, it's pretty unlikely that you'll be able to "get" all the way to #4 without the changes in #1 having been committed to disk.Theoreticallyy, it could happen, but thsystem’s general "incentive" here is to push #1 out as quickly as possible (to free up memory) and to delay committing 3 & 4 (to minimize metadata writes), both of which increase the likelihoothat thatat the data will flush.

Similarly, for mandatabasess, the concern here isn't simple data loss. As long as writes are ordered, all that's really happened is that you "appear" to crash earlier than you actually did. The problem for databases is that the writes AREN'T necessarily ordered:

"The disk drive may also re-order the data so that later writes may be present, while earlier writes are not."

...but that's much less of a factor if you're dealing with "larger" files being written as monolithic "chunks".

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Lock Contention in APFS/Kernel?
 
 
Q