Hello! Some colleagues and work on Jujutsu, a version control system compatible with git, and I think we've uncovered a potential lock contention bug in either APFS or the Darwin kernel. There are four contributing factors to us thinking this is related to APFS or the Kernel:
- jj's testsuite uses nextest, a test runner for Rust that spawns each individual test as a separate process.
- The testsuite slowed down by a factor of ~5x on macOS after jj started using
fsync. The slowdown increases as additional cores are allocated. - A similar slowdown did not occur on ext4.
- Similar performance issues were reported in the past by a former Mercurial maintainer: https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/.
My friend and colleague André has measured the test suite on an M3 Ultra with both a ramdisk and a traditional SSD and produced this graph:
(The most thorough writeup is the discussion on this pull request.)
I know I should file a feedback/bug report, but before I do, I'm struggling with profiling and finding kernel/APFS frames in my profiles so that I can properly attribute the cause of this apparent lock contention. Naively, I ran xctrace record --template 'Time Profiler' --output output.trace --launch /Users/dbarsky/.cargo/bin/cargo-nextest nextest run, and while that detected all processes spawned by nextest, it didn't record all processes as part of the same inspectable profile and didn't really show any frames from the kernel/APFS—I had to select individual processes. So I don't waste people's time and so that I can point a frame/smoking gun in the right system, how can I can use instruments to profile where the kernel and/or APFS are spending its time? Do I need to disable SIP?