Kernel panic related to Watchdog in custom virtual file system

Hi. I am facing a panic in distributed virtual filesystem of my own making. The panic arises on attempt of copying a large folder, or writing a large file (both around 20gb). An important note here is that the amount of files we try to copy is larger than available space (for testing purposes, the virtual file system had a capacity of 18 gigabytes).

  1. The panic arises somewhere on 12-14gigabytes deep into copying. On the moment of panic, there are still several gigabytes of storage left.
  2. The problem is present for sure for such architectures and macOS versions:

Sonoma 14.7.1 arm64e Monterey 12.7.5 arm64e Ventura 13.7.1 intel

  1. Part from panic log from Ventura 13.7.1 intel, with symbolicated addresses:

panic(cpu 2 caller 0xffffff80191a191a): watchdog timeout: no checkins from watchdogd in 90 seconds (48 total checkins since monitoring last enabled) Panicked task 0xffffff907c99f698: 191 threads: pid 0: kernel_task Backtrace (CPU 2), panicked thread: 0xffffff86e359cb30, Frame : Return Address 0xffffffff001d7bb0 : 0xffffff8015e70c7d mach_kernel : _handle_debugger_trap + 0x4ad 0xffffffff001d7c00 : 0xffffff8015fc52e4 mach_kernel : _kdp_i386_trap + 0x114 0xffffffff001d7c40 : 0xffffff8015fb4df7 mach_kernel : _kernel_trap + 0x3b7 0xffffffff001d7c90 : 0xffffff8015e11971 mach_kernel : _return_from_trap + 0xc1 0xffffffff001d7cb0 : 0xffffff8015e70f5d mach_kernel : _DebuggerTrapWithState + 0x5d 0xffffffff001d7da0 : 0xffffff8015e70607 mach_kernel : _panic_trap_to_debugger + 0x1a7 0xffffffff001d7e00 : 0xffffff80165db9a3 mach_kernel : _panic_with_options + 0x89 0xffffffff001d7ef0 : 0xffffff80191a191a com.apple.driver.watchdog : IOWatchdog::userspacePanic(OSObject*, void*, IOExternalMethodArguments*) (.cold.1) 0xffffffff001d7f20 : 0xffffff80191a10a1 com.apple.driver.watchdog : IOWatchdog::checkWatchdog() + 0xd7 0xffffffff001d7f50 : 0xffffff80174f960b com.apple.driver.AppleSMC : SMCWatchDogTimer::watchdogThread() + 0xbb 0xffffffff001d7fa0 : 0xffffff8015e1119e mach_kernel : _call_continuation + 0x2e Kernel Extensions in backtrace: com.apple.driver.watchdog(1.0)[BD08CE2D-77F5-358C-8F0D-A570540A0BE7]@0xffffff801919f000->0xffffff80191a1fff com.apple.driver.AppleSMC(3.1.9)[DD55DA6A-679A-3797-947C-0B50B7B5B659]@0xffffff80174e7000->0xffffff8017503fff dependency: com.apple.driver.watchdog(1)[BD08CE2D-77F5-358C-8F0D-A570540A0BE7]@0xffffff801919f000->0xffffff80191a1fff dependency: com.apple.iokit.IOACPIFamily(1.4)[D342E754-A422-3F44-BFFB-DEE93F6723BC]@0xffffff8018446000->0xffffff8018447fff dependency: com.apple.iokit.IOPCIFamily(2.9)[481BF782-1F4B-3F54-A34A-CF12A822C40D]@0xffffff80188b6000->0xffffff80188e7fff

Process name corresponding to current thread (0xffffff86e359cb30): kernel_task Boot args: keepsyms=1

Mac OS version: 22H221

Kernel version: Darwin Kernel Version 22.6.0: Thu Sep 5 20:48:48 PDT 2024; root:xnu-8796.141.3.708.1~1/RELEASE_X86_64


The origin of the problem is surely inside my filesystem. However, the panic happens not there but somewhere in watchdog. As far as I can tell, the source code for watchdog is not available for public.

I can't understand what causes the panic. Let's say we have run out of space. Couldn't write data. Writing received a proper error message and aborted. That's what is expected.

However, it is unclear for why the panic arises.

Answered by DTS Engineer in 825529022

The origin of the problem is surely inside my filesystem. However, the panic happens not there but somewhere in watchdog. As far as I can tell, the source code for watchdog is not available for public.

I actually just wrote about the watchdog here, but the short summary is that certain critical daemons are required to confirm that user space is still responsive by periodically checking in with the kernel. If they fail to do so, the user space is presumed to be hung and the watchdog panics the kernel to clear the hang (so the system can return to normal operation) and collect diagnostic data (so we can fix the user space hang).

Note that this means that the kernel stack trace is largely useless, as it basically just shows the way the watchdog panics every time it panic. The system wide stack trace is sometimes useful, however, it's difficult to manually symbolicate, hard to interpret unless you're very familiar with the system, and much of the time doesn't really have any clear "smoking gun". The panic message itself does tell you which daemon failed to check in and that can be a useful hint about what's actually going wrong.

However, in this case the daemon was watchdogd itself:

watchdog timeout: no checkins from watchdogd in 90 

...which isn't very much of a hint.

I can't understand what causes the panic. Let's say we have run out of space. Couldn't write data. Writing received a proper error message and aborted. That's what is expected.

First question I have here is are you sure that this space issue itself is critical. For example, does everything work fine if you copy 25 GB to a 30 GB target? Similarly, can you trigger the panic with smaller copies if you reduce the capacity?

One thing to keep in mind here is that the system itself has very little visibility into how much free space you actually have. You do export some metadata to it but that's largely for display purposes, not something it actually alters it's own behavior based on. The issue here is that the system does't actually have any way to predict* how the data it feeds into you will actually alter your available space, so all it can really to is push the data "to" you and let you fail the copy when you're full.

Where is your data actually being stored and how does it "reach" that location? One thing the watchdog panic does indicate is that panic isn't about a problem specifically in the kernel itself, but is actually caused by a disruption in user space. If you have a daemon that's handling network I/O for you, then that might be worth focusing on.

Finally, one technique I often use here is to focus on trying to make the problem worse, NOT fixing it. For example, I might try doubling the I/O my daemon actually generates under the theory that it's easier to artificially generate network traffic instead of trying to push more data through the VFS system. Similarly, if I'm concerned that this might be about caused by how VFS operations are handled, you can easily generate VASTLY more VFS activity by simply discarding all "real" data and avoiding the cost of any actual transfer.

*The Finder preflights copies because, in practice, the benefit of failing to start a copy that will probably fail outweighs the downside of failing a copy that would have succeeded. That's primarily becuase the characteristics of common file systems and the data users tyipcally copy mean that the Finder's preflight is generally correct about whether or not a copy will succeed. However, you can absolutely create data sets that fail in either direction and, of course, other I/O can always invalidate the initial preflight.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

The origin of the problem is surely inside my filesystem. However, the panic happens not there but somewhere in watchdog. As far as I can tell, the source code for watchdog is not available for public.

I actually just wrote about the watchdog here, but the short summary is that certain critical daemons are required to confirm that user space is still responsive by periodically checking in with the kernel. If they fail to do so, the user space is presumed to be hung and the watchdog panics the kernel to clear the hang (so the system can return to normal operation) and collect diagnostic data (so we can fix the user space hang).

Note that this means that the kernel stack trace is largely useless, as it basically just shows the way the watchdog panics every time it panic. The system wide stack trace is sometimes useful, however, it's difficult to manually symbolicate, hard to interpret unless you're very familiar with the system, and much of the time doesn't really have any clear "smoking gun". The panic message itself does tell you which daemon failed to check in and that can be a useful hint about what's actually going wrong.

However, in this case the daemon was watchdogd itself:

watchdog timeout: no checkins from watchdogd in 90 

...which isn't very much of a hint.

I can't understand what causes the panic. Let's say we have run out of space. Couldn't write data. Writing received a proper error message and aborted. That's what is expected.

First question I have here is are you sure that this space issue itself is critical. For example, does everything work fine if you copy 25 GB to a 30 GB target? Similarly, can you trigger the panic with smaller copies if you reduce the capacity?

One thing to keep in mind here is that the system itself has very little visibility into how much free space you actually have. You do export some metadata to it but that's largely for display purposes, not something it actually alters it's own behavior based on. The issue here is that the system does't actually have any way to predict* how the data it feeds into you will actually alter your available space, so all it can really to is push the data "to" you and let you fail the copy when you're full.

Where is your data actually being stored and how does it "reach" that location? One thing the watchdog panic does indicate is that panic isn't about a problem specifically in the kernel itself, but is actually caused by a disruption in user space. If you have a daemon that's handling network I/O for you, then that might be worth focusing on.

Finally, one technique I often use here is to focus on trying to make the problem worse, NOT fixing it. For example, I might try doubling the I/O my daemon actually generates under the theory that it's easier to artificially generate network traffic instead of trying to push more data through the VFS system. Similarly, if I'm concerned that this might be about caused by how VFS operations are handled, you can easily generate VASTLY more VFS activity by simply discarding all "real" data and avoiding the cost of any actual transfer.

*The Finder preflights copies because, in practice, the benefit of failing to start a copy that will probably fail outweighs the downside of failing a copy that would have succeeded. That's primarily becuase the characteristics of common file systems and the data users tyipcally copy mean that the Finder's preflight is generally correct about whether or not a copy will succeed. However, you can absolutely create data sets that fail in either direction and, of course, other I/O can always invalidate the initial preflight.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi. Firstly, thank you in advance. The panic was indeed rising from storage overflow on my server side, infinite amount of rejects and infinite amount of attempts from client to push even more data. I guess somewhere there the watchdog triggers.

What I haven't seen at the first place - is that the error indeed comes to our client (a proper one, 28, ENOSPC) during write operation. However, the IO (write in this particular case) is being handled by vnop_strategy. Whether it would have been a normal vnop_write, the error message returned would simply be transferred back to calling process, printing that error in a system-defined way.

However, when we have the same problem but in vnop_strategy, it seems like the returned value from vnop_strategy doesn't play any role. The filesystem doesn't care whether it was success or error.

I have already fixed more than 10 bugs, which were provoking no "symptoms" - no errors, just hanging, stopping execution, etc - all that was under the veil until I have added proper debug for vnop_strategy.

In case of this problem, the "best" solution (but actually more of a "hack") I have came up with is simply terminate the parent process (proc_signal(pid, SIGKILL)) . It does resolve the problem with panic. However, once again, it will not print any error message neither for console nor for applications.

Is there anyway to "signalize" to filesystem in a better way ? Or how to make filesystem treat vnop_strategy errors as other vnop_* errors to just return them to calling processes ?

P.S. the return value for vnop_strategy is indeed omitted.

https://github.com/apple-oss-distributions/xnu/blob/main/bsd/vfs/vfs_cluster.c#L2313

So, let me start with the basic issue here;

P.S. the return value for vnop_strategy is indeed omitted.

So, the first thing to understand here is why "VNOP_WRITE" and "VNOP_STRATEGY" both exist. Stat with VNOP_WRITE:

 *  @discussion VNOP_WRITE() is to write() as VNOP_READ() is to read().  The filesystem may use
 *  the buffer cache, the cluster layer, or an alternative method to write its data; uio routines will be used to see that data
 *  is copied to the correct virtual address in the correct address space and will update its uio argument
 *  to indicate how much data has been moved.

The critical sentence here is:

"The filesystem may use the buffer cache, the cluster layer, or an alternative method to write its data"

In other words, VNOP_WRITE does NOT have to actually write anything to disk. Most of the time, it will actually be moving the data into the cache to be written out to disk at some later point.

Moving to "VNOP_STRATEGY":

 *  @discussion A filesystem strategy routine takes a buffer, performs whatever manipulations are necessary for passing
 *  the I/O request down to the device layer, and calls the appropriate device's strategy routine.  Most filesystems should
 *  just call buf_strategy() with "bp" as the argument.
 

The key words there are "passing the I/O request down". It initiate a request, which means it won't know what happened to the I/O until some later point (when the request completes).

That leads to here:

Or how to make filesystem treat vnop_strategy errors as other vnop_* errors to just return them to calling processes ?

What calling process? I'd have to dig more to figure out where they're initiating, but VNOP_STRATEGY is part of the VFS systems overall "infrastructure", not the backend of a particular syscall. In real world usage, I wouldn't assume the calling process is necessarily meaningful.

Is there anyway to "signalize" to filesystem in a better way ?

So, I think there are two different issues at work here:

  1. Your VFS drivers job is to "protect" the kernel and the larger system, so it needs to fail operations in a way that doesn't disrupt the system further. Part of that also means that you may end up reports systemic failures even if a particular operation could succeed. For example, if you "know" your connection to your server is dead, succeeding particular reads because you HAPPEN to already have the data is probably worse than simply failing all read. Having a consistent result is more useful than the "mixed" behavior.

  2. I haven't looked closely at how exactly the system does this, but network file systems often require some kind of "side channel" mechanism for reporting errors back to the user. For example, the SMB connection dialogs you see when a server disconnects.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Kernel panic related to Watchdog in custom virtual file system
 
 
Q