Lock Contention in APFS/Kernel?

Hello! Some colleagues and work on Jujutsu, a version control system compatible with git, and I think we've uncovered a potential lock contention bug in either APFS or the Darwin kernel. There are four contributing factors to us thinking this is related to APFS or the Kernel:

  • jj's testsuite uses nextest, a test runner for Rust that spawns each individual test as a separate process.
  • The testsuite slowed down by a factor of ~5x on macOS after jj started using fsync. The slowdown increases as additional cores are allocated.
  • A similar slowdown did not occur on ext4.
  • Similar performance issues were reported in the past by a former Mercurial maintainer: https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/.

My friend and colleague André has measured the test suite on an M3 Ultra with both a ramdisk and a traditional SSD and produced this graph:

(The most thorough writeup is the discussion on this pull request.)

I know I should file a feedback/bug report, but before I do, I'm struggling with profiling and finding kernel/APFS frames in my profiles so that I can properly attribute the cause of this apparent lock contention. Naively, I ran xctrace record --template 'Time Profiler' --output output.trace --launch /Users/dbarsky/.cargo/bin/cargo-nextest nextest run, and while that detected all processes spawned by nextest, it didn't record all processes as part of the same inspectable profile and didn't really show any frames from the kernel/APFS—I had to select individual processes. So I don't waste people's time and so that I can point a frame/smoking gun in the right system, how can I can use instruments to profile where the kernel and/or APFS are spending its time? Do I need to disable SIP?

There's a System Trace template that is a better starting point than Time Profiler for your needs here.

— Ed Ford,  DTS Engineer

Let me start with the comparison here:

Similar performance issues were reported in the past by a former Mercurial maintainer: https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/.

Unfortunately, there are some fairly significant flaws in the analysis there. Picking one of his data points:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 172.209s

real    2m52.470s
user    1m54.053s
sys    23m42.808s

Based on his overall article, I believe he thinks that test took ~23m when in fact it took ~3m. That is, the actual test time was:

...in 172.209s -> ~2.52 ->
real    2m52.470s

In general, that's what "real" means in most time tests -> "how long did this test actually take".

In terms of the difference in system time, that appears to have been due to an accounting issue in the kernel. Per the investigation that was done on the bug, this particular test script ended up having multiple threads blocked in the kernel (waiting on work to complete) and the kernel included those times in sys time when it probably should not have.

The full details of that investigation aren't something I can really summarize, the rough summary is that some of the difference was due to fundamental design differences between Linux and macOS (things like the sandbox and how much memory the I/O system was allowed to consume) and some of it was due to increased lock contention with increased core count. Everything found was largely addressed in macOS 10.14 (Mojave).

Moving to your specific concerns:

A similar slowdown did not occur on ext4.

Unfortunately, comparisons between different systems are extremely tricky due to "fundamental design differences.” One of macOS’s fundamental design choices is that its goal is to "responsive" NOT "fast.” In other words, the system’s ability to quickly react to the user’s actions is MORE important than the total time it takes to complete any given operation. In the I/O system, that means restricting I/O memory commitments more than Linux, both to reserve memory for other purposes and to reduce the chance that the system will have to stall waiting for memory to be flushed to disk.

Next here:

The testsuite slowed down by a factor of ~5x on macOS after jj started using fsync.

A few questions:

  • Are you calling "fsync" or are you doing fcntl(F_FULLFSYNC)? The second call will be significantly slower, but if you need the data to actually reach the disk, then you need to use "F_FULLFSYNC" (see man fsync(2) and fcntl(2) for more details).

  • To some extent, either call will exhibit a certain degree of "lock-like" behavior, both to stabilize the state of the file system at a particular "point" and to prevent the data that's actively being "stabilized" from being modified.

  • If all of these tests are interacting with the same dataset, then that obviously exacerbates the issue.

So I don't waste people's time and so that I can point a frame/smoking gun in the right system, how can I use instruments to profile where the kernel and/or APFS are spending its time? Do I need to disable SIP?

Ed's suggestion to use System Trace is a good one; however, the most useful thing you can do is provide tests we can run ourselves. The reality is that even if when a system trace clearly demonstrates "a problem", we generally need to reproduce the problem ourselves so that we can confirm any fix actually does what we think it will.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks y'all! Ed, I'll try running the system trace tomorrow morning (I'm in New York City, so it's the end of the day for me).

Kevin:

Everything found was largely addressed in macOS 10.14 (Mojave).

Ah! Good to know that my information is kinda out of date :)

Unfortunately, comparisons between different systems are extremely tricky due to "fundamental design differences.” One of macOS’s fundamental design choices is that its goal is to "responsive" NOT "fast.” In other words, the system’s ability to quickly react to the user’s actions is MORE important than the total time it takes to complete any given operation. In the I/O system, that means restricting I/O memory commitments more than Linux, both to reserve memory for other purposes and to reduce the chance that the system will have to stall waiting for memory to be flushed to disk.

Understood! Makes a lot of sense given the common, respective usages of Linux vs. macOS!

Are you calling "fsync" or are you doing fcntl(F_FULLFSYNC)? The second call will be significantly slower, but if you need the data to actually reach the disk, then you need to use "F_FULLFSYNC" (see man fsync(2) and fcntl(2) for more details).

I'm not sure if we're using fsync or fcntl(F_FULLFSYNC), but I'll followup and give an answer tomorrow.

If all of these tests are interacting with the same dataset, then that obviously exacerbates the issue.

The tests shouldn't be interacting with the same logical data set, we're just using tmp as an on-disk location.

however, the most useful thing you can do is provide tests we can run ourselves.

It's not too bad:

  1. I'm guessing y'all have the clang/the Xcode command line tools installed, but if you don't, you need those!
  2. Install Rust. I recommend using https://rustup.rs/, which will install the Rust compiler and Cargo at ~/.rustup/ and ~/.cargo. It will also add those paths to your $PATH.
  3. Clone jj: git clone https://github.com/jj-vcs/jj.git
  4. Install the per-process test runner we use, cargo-nextest, by running cargo install cargo-nextest --locked. cargo-nextest should now be in your $PATH, installed at ~/.cargo/bin/cargo-nextest.
  5. Inside the cloned jj repo, run cargo nextest run --no-fail-fast.

The above should be sufficient to replicate the slowness we've observed, but let me know if you run into any issues.

Lock Contention in APFS/Kernel?
 
 
Q