Hi. I am facing a panic in distributed virtual filesystem of my own making. The panic arises on attempt of copying a large folder, or writing a large file (both around 20gb). An important note here is that the amount of files we try to copy is larger than available space (for testing purposes, the virtual file system had a capacity of 18 gigabytes).
- The panic arises somewhere on 12-14gigabytes deep into copying. On the moment of panic, there are still several gigabytes of storage left.
- The problem is present for sure for such architectures and macOS versions:
Sonoma 14.7.1 arm64e Monterey 12.7.5 arm64e Ventura 13.7.1 intel
- Part from panic log from Ventura 13.7.1 intel, with symbolicated addresses:
panic(cpu 2 caller 0xffffff80191a191a): watchdog timeout: no checkins from watchdogd in 90 seconds (48 total checkins since monitoring last enabled) Panicked task 0xffffff907c99f698: 191 threads: pid 0: kernel_task Backtrace (CPU 2), panicked thread: 0xffffff86e359cb30, Frame : Return Address 0xffffffff001d7bb0 : 0xffffff8015e70c7d mach_kernel : _handle_debugger_trap + 0x4ad 0xffffffff001d7c00 : 0xffffff8015fc52e4 mach_kernel : _kdp_i386_trap + 0x114 0xffffffff001d7c40 : 0xffffff8015fb4df7 mach_kernel : _kernel_trap + 0x3b7 0xffffffff001d7c90 : 0xffffff8015e11971 mach_kernel : _return_from_trap + 0xc1 0xffffffff001d7cb0 : 0xffffff8015e70f5d mach_kernel : _DebuggerTrapWithState + 0x5d 0xffffffff001d7da0 : 0xffffff8015e70607 mach_kernel : _panic_trap_to_debugger + 0x1a7 0xffffffff001d7e00 : 0xffffff80165db9a3 mach_kernel : _panic_with_options + 0x89 0xffffffff001d7ef0 : 0xffffff80191a191a com.apple.driver.watchdog : IOWatchdog::userspacePanic(OSObject*, void*, IOExternalMethodArguments*) (.cold.1) 0xffffffff001d7f20 : 0xffffff80191a10a1 com.apple.driver.watchdog : IOWatchdog::checkWatchdog() + 0xd7 0xffffffff001d7f50 : 0xffffff80174f960b com.apple.driver.AppleSMC : SMCWatchDogTimer::watchdogThread() + 0xbb 0xffffffff001d7fa0 : 0xffffff8015e1119e mach_kernel : _call_continuation + 0x2e Kernel Extensions in backtrace: com.apple.driver.watchdog(1.0)[BD08CE2D-77F5-358C-8F0D-A570540A0BE7]@0xffffff801919f000->0xffffff80191a1fff com.apple.driver.AppleSMC(3.1.9)[DD55DA6A-679A-3797-947C-0B50B7B5B659]@0xffffff80174e7000->0xffffff8017503fff dependency: com.apple.driver.watchdog(1)[BD08CE2D-77F5-358C-8F0D-A570540A0BE7]@0xffffff801919f000->0xffffff80191a1fff dependency: com.apple.iokit.IOACPIFamily(1.4)[D342E754-A422-3F44-BFFB-DEE93F6723BC]@0xffffff8018446000->0xffffff8018447fff dependency: com.apple.iokit.IOPCIFamily(2.9)[481BF782-1F4B-3F54-A34A-CF12A822C40D]@0xffffff80188b6000->0xffffff80188e7fff
Process name corresponding to current thread (0xffffff86e359cb30): kernel_task Boot args: keepsyms=1
Mac OS version: 22H221
Kernel version: Darwin Kernel Version 22.6.0: Thu Sep 5 20:48:48 PDT 2024; root:xnu-8796.141.3.708.1~1/RELEASE_X86_64
The origin of the problem is surely inside my filesystem. However, the panic happens not there but somewhere in watchdog. As far as I can tell, the source code for watchdog is not available for public.
I can't understand what causes the panic. Let's say we have run out of space. Couldn't write data. Writing received a proper error message and aborted. That's what is expected.
However, it is unclear for why the panic arises.
The origin of the problem is surely inside my filesystem. However, the panic happens not there but somewhere in watchdog. As far as I can tell, the source code for watchdog is not available for public.
I actually just wrote about the watchdog here, but the short summary is that certain critical daemons are required to confirm that user space is still responsive by periodically checking in with the kernel. If they fail to do so, the user space is presumed to be hung and the watchdog panics the kernel to clear the hang (so the system can return to normal operation) and collect diagnostic data (so we can fix the user space hang).
Note that this means that the kernel stack trace is largely useless, as it basically just shows the way the watchdog panics every time it panic. The system wide stack trace is sometimes useful, however, it's difficult to manually symbolicate, hard to interpret unless you're very familiar with the system, and much of the time doesn't really have any clear "smoking gun". The panic message itself does tell you which daemon failed to check in and that can be a useful hint about what's actually going wrong.
However, in this case the daemon was watchdogd itself:
watchdog timeout: no checkins from watchdogd in 90
...which isn't very much of a hint.
I can't understand what causes the panic. Let's say we have run out of space. Couldn't write data. Writing received a proper error message and aborted. That's what is expected.
First question I have here is are you sure that this space issue itself is critical. For example, does everything work fine if you copy 25 GB to a 30 GB target? Similarly, can you trigger the panic with smaller copies if you reduce the capacity?
One thing to keep in mind here is that the system itself has very little visibility into how much free space you actually have. You do export some metadata to it but that's largely for display purposes, not something it actually alters it's own behavior based on. The issue here is that the system does't actually have any way to predict* how the data it feeds into you will actually alter your available space, so all it can really to is push the data "to" you and let you fail the copy when you're full.
Where is your data actually being stored and how does it "reach" that location? One thing the watchdog panic does indicate is that panic isn't about a problem specifically in the kernel itself, but is actually caused by a disruption in user space. If you have a daemon that's handling network I/O for you, then that might be worth focusing on.
Finally, one technique I often use here is to focus on trying to make the problem worse, NOT fixing it. For example, I might try doubling the I/O my daemon actually generates under the theory that it's easier to artificially generate network traffic instead of trying to push more data through the VFS system. Similarly, if I'm concerned that this might be about caused by how VFS operations are handled, you can easily generate VASTLY more VFS activity by simply discarding all "real" data and avoiding the cost of any actual transfer.
*The Finder preflights copies because, in practice, the benefit of failing to start a copy that will probably fail outweighs the downside of failing a copy that would have succeeded. That's primarily becuase the characteristics of common file systems and the data users tyipcally copy mean that the Finder's preflight is generally correct about whether or not a copy will succeed. However, you can absolutely create data sets that fail in either direction and, of course, other I/O can always invalidate the initial preflight.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware