Lock Contention in APFS/Kernel?

Hello! Some colleagues and work on Jujutsu, a version control system compatible with git, and I think we've uncovered a potential lock contention bug in either APFS or the Darwin kernel. There are four contributing factors to us thinking this is related to APFS or the Kernel:

  • jj's testsuite uses nextest, a test runner for Rust that spawns each individual test as a separate process.
  • The testsuite slowed down by a factor of ~5x on macOS after jj started using fsync. The slowdown increases as additional cores are allocated.
  • A similar slowdown did not occur on ext4.
  • Similar performance issues were reported in the past by a former Mercurial maintainer: https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/.

My friend and colleague André has measured the test suite on an M3 Ultra with both a ramdisk and a traditional SSD and produced this graph:

(The most thorough writeup is the discussion on this pull request.)

I know I should file a feedback/bug report, but before I do, I'm struggling with profiling and finding kernel/APFS frames in my profiles so that I can properly attribute the cause of this apparent lock contention. Naively, I ran xctrace record --template 'Time Profiler' --output output.trace --launch /Users/dbarsky/.cargo/bin/cargo-nextest nextest run, and while that detected all processes spawned by nextest, it didn't record all processes as part of the same inspectable profile and didn't really show any frames from the kernel/APFS—I had to select individual processes. So I don't waste people's time and so that I can point a frame/smoking gun in the right system, how can I can use instruments to profile where the kernel and/or APFS are spending its time? Do I need to disable SIP?

There's a System Trace template that is a better starting point than Time Profiler for your needs here.

— Ed Ford,  DTS Engineer

Let me start with the comparison here:

Similar performance issues were reported in the past by a former Mercurial maintainer: https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/.

Unfortunately, there are some fairly significant flaws in the analysis there. Picking one of his data points:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 172.209s

real    2m52.470s
user    1m54.053s
sys    23m42.808s

Based on his overall article, I believe he thinks that test took ~23m when in fact it took ~3m. That is, the actual test time was:

...in 172.209s -> ~2.52 ->
real    2m52.470s

In general, that's what "real" means in most time tests -> "how long did this test actually take".

In terms of the difference in system time, that appears to have been due to an accounting issue in the kernel. Per the investigation that was done on the bug, this particular test script ended up having multiple threads blocked in the kernel (waiting on work to complete) and the kernel included those times in sys time when it probably should not have.

The full details of that investigation aren't something I can really summarize, the rough summary is that some of the difference was due to fundamental design differences between Linux and macOS (things like the sandbox and how much memory the I/O system was allowed to consume) and some of it was due to increased lock contention with increased core count. Everything found was largely addressed in macOS 10.14 (Mojave).

Moving to your specific concerns:

A similar slowdown did not occur on ext4.

Unfortunately, comparisons between different systems are extremely tricky due to "fundamental design differences.” One of macOS’s fundamental design choices is that its goal is to "responsive" NOT "fast.” In other words, the system’s ability to quickly react to the user’s actions is MORE important than the total time it takes to complete any given operation. In the I/O system, that means restricting I/O memory commitments more than Linux, both to reserve memory for other purposes and to reduce the chance that the system will have to stall waiting for memory to be flushed to disk.

Next here:

The testsuite slowed down by a factor of ~5x on macOS after jj started using fsync.

A few questions:

  • Are you calling "fsync" or are you doing fcntl(F_FULLFSYNC)? The second call will be significantly slower, but if you need the data to actually reach the disk, then you need to use "F_FULLFSYNC" (see man fsync(2) and fcntl(2) for more details).

  • To some extent, either call will exhibit a certain degree of "lock-like" behavior, both to stabilize the state of the file system at a particular "point" and to prevent the data that's actively being "stabilized" from being modified.

  • If all of these tests are interacting with the same dataset, then that obviously exacerbates the issue.

So I don't waste people's time and so that I can point a frame/smoking gun in the right system, how can I use instruments to profile where the kernel and/or APFS are spending its time? Do I need to disable SIP?

Ed's suggestion to use System Trace is a good one; however, the most useful thing you can do is provide tests we can run ourselves. The reality is that even if when a system trace clearly demonstrates "a problem", we generally need to reproduce the problem ourselves so that we can confirm any fix actually does what we think it will.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks y'all! Ed, I'll try running the system trace tomorrow morning (I'm in New York City, so it's the end of the day for me).

Kevin:

Everything found was largely addressed in macOS 10.14 (Mojave).

Ah! Good to know that my information is kinda out of date :)

Unfortunately, comparisons between different systems are extremely tricky due to "fundamental design differences.” One of macOS’s fundamental design choices is that its goal is to "responsive" NOT "fast.” In other words, the system’s ability to quickly react to the user’s actions is MORE important than the total time it takes to complete any given operation. In the I/O system, that means restricting I/O memory commitments more than Linux, both to reserve memory for other purposes and to reduce the chance that the system will have to stall waiting for memory to be flushed to disk.

Understood! Makes a lot of sense given the common, respective usages of Linux vs. macOS!

Are you calling "fsync" or are you doing fcntl(F_FULLFSYNC)? The second call will be significantly slower, but if you need the data to actually reach the disk, then you need to use "F_FULLFSYNC" (see man fsync(2) and fcntl(2) for more details).

I'm not sure if we're using fsync or fcntl(F_FULLFSYNC), but I'll followup and give an answer tomorrow.

If all of these tests are interacting with the same dataset, then that obviously exacerbates the issue.

The tests shouldn't be interacting with the same logical data set, we're just using tmp as an on-disk location.

however, the most useful thing you can do is provide tests we can run ourselves.

It's not too bad:

  1. I'm guessing y'all have the clang/the Xcode command line tools installed, but if you don't, you need those!
  2. Install Rust. I recommend using https://rustup.rs/, which will install the Rust compiler and Cargo at ~/.rustup/ and ~/.cargo. It will also add those paths to your $PATH.
  3. Clone jj: git clone https://github.com/jj-vcs/jj.git
  4. Install the per-process test runner we use, cargo-nextest, by running cargo install cargo-nextest --locked. cargo-nextest should now be in your $PATH, installed at ~/.cargo/bin/cargo-nextest.
  5. Inside the cloned jj repo, run cargo nextest run --no-fail-fast.

The above should be sufficient to replicate the slowness we've observed, but let me know if you run into any issues.

I'm not sure if we're using fsync or fcntl(F_FULLFSYNC), but I'll follow up and give an answer tomorrow.

Based on the general description, I suspect it’s F_FULLFSYNC. F_FULLFSYNC asks the target device to flush its hardware cache, which both takes longer and has "broader" consequences. That doesn't make it "wrong" (the right choice is "the right choice"), but it does have performance consequences.

The above should be sufficient to replicate the slowness we've observed, but let me know if you run into any issues.

So, the next step here would be to file a bug that includes a system trace of the issue, a sysdiagnose taken immediately after you finish your test run, and the reproduction steps above. Please post the bug number back here once it's filed.

However, one note to set "expectations". Changing anything in the file system is very high risk, which means we tend to only use software updates to ship changes that:

  1. Address critical bugs.

  2. Are small changes which provide significant benefit and have minimal risk and are easy to validate.

Everything else we hold off shipping until the next major release to ensure that those changes have as much testing time as possible. That's the best way to protect our users’ data, but it does mean that it can often look like file system issues take a particularly long time to address.

In terms of addressing your immediate issue:

The tests shouldn't be interacting with the same logical data set; we're just using tmp as an on-disk location.

Have you looked at using separate volumes to further separate the data sets? Perhaps using a DiskImage with multiple APFS volumes as the target?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Based on the general description, I suspect it’s F_FULLFSYNC. F_FULLFSYNC asks the target device to flush its hardware cache, which both takes longer and has "broader" consequences. That doesn't make it "wrong" (the right choice is "the right choice"), but it does have performance consequences.

Yup, it is calling F_FULLFSYNC.

Tracing through the code, jj starts by calling std::fs::File::sync_data. This methods then calls sync_data on its private which, in turn, calls libc::fcntl(fd, libc::F_FULLFSYNC), but only for target triple vendors that are Apple. On a bunch of Unixes, std will instead call fdatasync, which does not appear to be implemented on Darwin, which, fair enough.

However, one note to set "expectations". [...] Everything else we hold off shipping until the next major release to ensure that those changes have as much testing time as possible. That's the best way to protect our users’ data, but it does mean that it can often look like file system issues take a particularly long time to address.

No worries at all! I don't expect this to be fixed any time soon and I know I'm asking for potentially a very large change if y'all are able to reproduce this properly.

Have you looked at using separate volumes to further separate the data sets? Perhaps using a DiskImage with multiple APFS volumes as the target?

We haven't, but that's certainly an option to explore, especially it doesn't require the end-user to approve/run anything! So far, we've found success by having developers on macOS run:

sudo mkdir -p /Volumes/RAMDisk
sudo chmod a+wx /Volumes/RAMDisk
sudo mount_tmpfs -e /Volumes/RAMDisk

...and running tests as env TMPDIR=/Volumes/RAMDisk cargo nextest run, but that doesn't Just Work™ out of the box. On an M3 Ultra, this brings our test runtimes back to to around 42.9s when the test runner is given 12 cores (above 12 cores, test runtimes increase, as you can see on the lower, blue graph in my initial post.). As most of us are on M{2, 3, 4} Pros instead of the Max or Ultra variants, this works out surprisingly nicely.

So, the next step here would be to file a bug that includes a system trace of the issue, a sysdiagnose taken immediately after you finish your test run, and the reproduction steps above. Please post the bug number back here once it's filed.

Can do! That'll probably be a tomorrow morning thing.

Lock Contention in APFS/Kernel?
 
 
Q