VFS deadlock (WebDAV on Apple Silicon only)

  • FB9108925
  • FB10408005

Since Apple Silicon we've seen a lot of WebDAV instability in macOS 11.x, 12.x and now 13.x that isn't found on x86 Macs. Some were fixed in earlier minor OS upgrades (e.g. webdavfs-387.100.1 that added a missing mutex init), but it's still highly unreliable. The purpose of this post is to put more focus on the bug, see if there is something else we can do to help solve this, as well as hear about potential workarounds from other people experiencing the same problems.

I've got a reproducible case described below that triggers a deadlock in VFS every time, requiring a hard reboot to fully recover. Before reboot I've captured this stack trace showing the WebDAV/VFS/UBC/VM layers getting tangled up (macOS 13.2 Build 22D49 running on Macmini9,1):

  Thread 0x16358    1001 samples (1-1001)    priority 46 (base 31)
  1001  thread_start + 8 (libsystem_pthread.dylib + 7724) [0x18d2e0e2c]
    1001  _pthread_start + 148 (libsystem_pthread.dylib + 28780) [0x18d2e606c]
      1001  ??? (diskarbitrationd + 99400) [0x100d5c448]
        1001  unmount + 8 (libsystem_kernel.dylib + 55056) [0x18d2b2710]
         *1001  ??? (kernel.release.t8103 + 30712) [0xfffffe00083437f8]
           *1001  ??? (kernel.release.t8103 + 1775524) [0xfffffe00084ed7a4]
             *1001  ??? (kernel.release.t8103 + 7081508) [0xfffffe00089fce24]
               *1001  ??? (kernel.release.t8103 + 2522264) [0xfffffe00085a3c98]
                 *1001  ??? (kernel.release.t8103 + 2523168) [0xfffffe00085a4020]
                   *1001  vnode_iterate + 728 (kernel.release.t8103 + 2410988) [0xfffffe00085889ec]
                     *1001  ??? (kernel.release.t8103 + 6095404) [0xfffffe000890c22c]
                       *1001  ??? (kernel.release.t8103 + 1097172) [0xfffffe0008447dd4]
                         *1001  ??? (kernel.release.t8103 + 1100852) [0xfffffe0008448c34]
                           *1001  ??? (kernel.release.t8103 + 1024804) [0xfffffe0008436324]
                             *1001  ??? (kernel.release.t8103 + 1025092) [0xfffffe0008436444]
                               *1001  ??? (kernel.release.t8103 + 6497736) [0xfffffe000896e5c8]
                                 *1001  ??? (kernel.release.t8103 + 2705840) [0xfffffe00085d09b0]
                                   *1001  webdav_vnop_pageout + 432 (com.apple.filesystems.webdav + 16920) [0xfffffe000b2db7b8]
                                     *1001  webdav_vnop_close + 64 (com.apple.filesystems.webdav + 9492) [0xfffffe000b2d9ab4]
                                       *1001  webdav_vnop_close_locked + 96 (com.apple.filesystems.webdav + 19708) [0xfffffe000b2dc29c]
                                         *1001  webdav_close_mnomap + 264 (com.apple.filesystems.webdav + 20004) [0xfffffe000b2dc3c4]
                                           *1001  webdav_fsync + 404 (com.apple.filesystems.webdav + 20484) [0xfffffe000b2dc5a4]
                                             *1001  ubc_msync + 184 (kernel.release.t8103 + 6096856) [0xfffffe000890c7d8]
                                               *1001  ??? (kernel.release.t8103 + 1097172) [0xfffffe0008447dd4]
                                                 *1001  ??? (kernel.release.t8103 + 1100728) [0xfffffe0008448bb8]
                                                   *1001  lck_rw_sleep + 136 (kernel.release.t8103 + 505804) [0xfffffe00083b77cc]
                                                     *1001  ??? (kernel.release.t8103 + 607656) [0xfffffe00083d05a8]
                                                       *1001  ??? (kernel.release.t8103 + 613952) [0xfffffe00083d1e40]

I've spent countless hours reading the xnu-8792.81.2 and webdavfs-392 sources trying to understand what happens. Symbols mapped back to the source code tell me it's trying to flush a dirty mmap'ed file back to the WebDAV host when the volume is about to get unmounted, but I suspect the pageout request is triggered recursively, perhaps because the mmap'ed file has shrunk and pages need to be released?

The test case:

  1. Use Finder to connect to a WebDAV volume which holds a fairly large image (200 MB Photoshop file in my case).
  2. Navigate to this file in column mode so Finder renders a preview (using a QuickLook process). I believe this mmap's the file, but that alone isn't sufficient, so I think the Finder tries to write an updated thumbnail back to the volume as well.
  3. Click the Eject icon in the Finder to unmount the volume, which now deadlocks that file system.

In the end something remains unreleased in the filesystem since the unmount request never completes, so whether that's a VNode lock or just open file refcount or something else I don't know.

Now, why this deadlock is only seen on Apple Silicon is a mystery. Is Finder/QuickLook executing different code paths for generating or storing the thumbnail? Or is there yet more cases of uninitialized mutexes/locks that happen to be accidentally functional on x86 but expose a problem on AS? I've been through a lot of kernel source code trying to find any but have come up short. But since the above is easily reproduced I'm hoping someone with filesystem/kernel debug capability can succeed in pinpointing the bug. It's at least positive that the overall architecture works on x86 so I'm hoping it is a simple fix in the end.

The reason I'm debugging this is we've got a lot of customers running WebDAV on M1/M2 and they find Finder file copying highly unreliable (i.e. writing many files to the WebDAV server, possibly overwriting existing files; some users have reported a need to reboot 20 times a day). I'm really looking for a bug that's common to all of these tasks, not just the mmap + unmount problem which is a minimal test case that I've cooked up in the lab. The few spindumps I've seen from end users have also included the combination of webdav_vnop_pageout + webdav_fsync + ubc_msync + lck_rw_sleep even if unmounting wasn't the initial op that forced the deadlock.

This problem has been reproduced with different WebDAV server vendors, and there is a test account on a server running Apache provided in FB10408005 (though please select the PSD file, not just the tiny JPG).

Thanks in advance!

Replies

Thanks to anyone that has read my post and have been debugging the deadlock! Just a few days ago the source packages for Ventura 13.3 were published, and in that I found webdavfs-395. The following diff is one that I'm very happy to see:

https://github.com/apple-oss-distributions/webdavfs/commit/b7756b02549929bb18062ebcd76f0bbb75a120cb

That change does in fact target exactly the use-case of calling ubc_msync() on mmap'ed files at clean-up, but unfortunately real-world testing of a fresh Ventura 13.5b4 still triggers the deadlock despite including webdavfs-395.

In other words, it seems there is still a code path that goes through webdav_vnop_pageout, determine that is_open == TRUE and end up calling webdav_vnop_close -> ... -> webdav_fsync -> ubc_msync meaning the newly introduced flag WEBDAV_PAGEOUT_CLOSE_IN_RECLAIM wasn't set in webdav_vnop_pageout. So is vnode_isrecycled not really the check that's relevant, or is the page-out function called multiple times (either recursively or in sequence, e.g. due to webdav_unmount flushing twice), or maybe something else?

My test case remains the same as above, and for completeness I'm including a fresh thread dump from the deadlocked unmount thread on Ventura 13.5b4:

Date/Time:        2023-07-03 10:35:06.796 +0200
End time:         2023-07-03 10:35:16.801 +0200
OS Version:       macOS 13.5 (Build 22G5059d)
Architecture:     arm64e
Report Version:   40

Data Source:      Stackshots
Shared Cache:     725CB32F-D723-38F2-8952-4D21C1FD290B slid base address 0x198a8c000, slide 0x18a8c000 (System Primary)
Shared Cache:     D6EB184C-4628-3C49-9D21-5D5A97D08FDC slid base address 0x1bd9e8000, slide 0x3d9e8000 (DriverKit)
Shared Cache:     4E3FAD7E-E5B0-35FD-BF81-F0E22E907F07 slid base address 0x7ff8084cc000, slide 0x84cc000 (Rosetta)

Duration:         10.00s
Steps:            1001 (10ms sampling interval)

Hardware model:   Macmini9,1
Active cpus:      8
HW page size:     16384
VM page size:     16384

[...]

Process:          diskarbitrationd [537]
UUID:             9DF766CA-6596-3311-8A02-20FC83FD3A24
Path:             /usr/libexec/diskarbitrationd
Codesigning ID:   com.apple.diskarbitrationd
Shared Cache:     725CB32F-D723-38F2-8952-4D21C1FD290B slid base address 0x198a8c000, slide 0x18a8c000 (System Primary)
Architecture:     arm64e
Parent:           launchd [1]
UID:              0
Sudden Term:      Tracked
Footprint:        2993 KB
Time Since Fork:  1033s
Num samples:      1001 (1-1001)
Note:             1 idle work queue thread omitted

  Thread 0xf69    1001 samples (1-1001)    priority 31 (base 31)
  1001  _dispatch_sig_thread + 60 (libdispatch.dylib + 97144) [0x198cefb78]
    1001  __sigsuspend_nocancel + 8 (libsystem_kernel.dylib + 34280) [0x198e535e8]
     *1001  ??? (kernel.release.t8103 + 5219752) [0xfffffe00088aa5a8]

  Thread 0x55ad    1001 samples (1-1001)    priority 46 (base 31)
  1001  thread_start + 8 (libsystem_pthread.dylib + 7584) [0x198e86da0]
    1001  _pthread_start + 148 (libsystem_pthread.dylib + 28584) [0x198e8bfa8]
      1001  ??? (diskarbitrationd + 99436) [0x102a2c46c]
        1001  unmount + 8 (libsystem_kernel.dylib + 54736) [0x198e585d0]
         *1001  ??? (kernel.release.t8103 + 30712) [0xfffffe00083b77f8]
           *1001  ??? (kernel.release.t8103 + 1599252) [0xfffffe0008536714]
             *1001  ??? (kernel.release.t8103 + 6321364) [0xfffffe00089b74d4]
               *1001  ??? (kernel.release.t8103 + 2282292) [0xfffffe00085dd334]
                 *1001  ??? (kernel.release.t8103 + 2283096) [0xfffffe00085dd658]
                   *1001  vnode_iterate + 528 (kernel.release.t8103 + 2192876) [0xfffffe00085c75ec]
                     *1001  ??? (kernel.release.t8103 + 5437556) [0xfffffe00088df874]
                       *1001  ??? (kernel.release.t8103 + 987212) [0xfffffe00084a104c]
                         *1001  ??? (kernel.release.t8103 + 990312) [0xfffffe00084a1c68]
                           *1001  ??? (kernel.release.t8103 + 930160) [0xfffffe0008493170]
                             *1001  ??? (kernel.release.t8103 + 930448) [0xfffffe0008493290]
                               *1001  ??? (kernel.release.t8103 + 5806728) [0xfffffe0008939a88]
                                 *1001  webdav_vnop_pageout + 452 (com.apple.filesystems.webdav + 17112) [0xfffffe000b37d708]
                                   *1001  webdav_vnop_close + 64 (com.apple.filesystems.webdav + 9628) [0xfffffe000b37b9cc]
                                     *1001  webdav_vnop_close_locked + 96 (com.apple.filesystems.webdav + 19916) [0xfffffe000b37e1fc]
                                       *1001  webdav_close_mnomap + 264 (com.apple.filesystems.webdav + 20212) [0xfffffe000b37e324]
                                         *1001  webdav_fsync + 416 (com.apple.filesystems.webdav + 20704) [0xfffffe000b37e510]
                                           *1001  ubc_msync + 184 (kernel.release.t8103 + 5438968) [0xfffffe00088dfdf8]
                                             *1001  ??? (kernel.release.t8103 + 987212) [0xfffffe00084a104c]
                                               *1001  ??? (kernel.release.t8103 + 989672) [0xfffffe00084a19e8]
                                                 *1001  lck_rw_sleep + 132 (kernel.release.t8103 + 461616) [0xfffffe0008420b30]
                                                   *1001  ??? (kernel.release.t8103 + 555992) [0xfffffe0008437bd8]
                                                     *1001  ??? (kernel.release.t8103 + 562548) [0xfffffe0008439574]

I'm incredibly thankful that there's someone actively working on this problem, and please let me know if I can help in any way.