Odd memory usage in user space application causing kernel panics

Hello,

We are developing a multimedia routing platform written in Rust and uses gstreamer 1.20. We are targeting running on Mac Minis (older intel and newer M1/2/3/... w/ 8GB ram) using macOS 14.6.1

I have profiled memory usage using XCode instruments with the allocation tool, stack and heap memory is very stable once the pipelines are up and running.

There are between 50 to 100 incoming RTSP streams with multiple webrtc connections, so lots of network and memory bandwidth is being used.

However, we eventually see real memory usage increasing in Activity Monitor along with memory pressure increasing, but the heap/stack usage is constant in instruments, so we do not understand this behavior. Page fragmentation is a possibility, but have not been able to prove this with instruments.

Please see attached image.You can see that 10-minute run had a total of approx 4.3 GB of allocations, but only 50.17MB persistent.

Eventually we see kernel panics in either userspace watchdog timeout: no successful checkins from WindowServer (2 induced crashes) in 120 second or apcie[2:lan-1gb]::handleCompletionTimeoutInterrupt: completion timeout which I believe are caused by high system load and the kernel becoming unresponsive while the kernel is doing page compressions. We tested running with je-malloc for a while, but the kernel panics still occur.

We have multiple kernel panic recordings available, but they are too large to upload here. We are also having multiple kernel panics per day while running this application.

Any suggestions on how to prevent these kernel panics? If the system is out of memory, shouldn't our application crash with an out-of-memory and the kernel should NOT panic?

Thanks, Jeremy Prater

Answered by DTS Engineer in 825513022

We have multiple kernel panic recordings available, but they are too large to upload here. We are also having multiple kernel panics per day while running this application.

Please file a bug on this and then post the bug number back here. My initial thoughts are below, but I'd like to see the data you have and, by definition, all panics are "bugs"*.

So, let me start with some background and comments on this point:

If the system is out of memory, shouldn't our application crash with an out-of-memory and the kernel should NOT panic?

First off, in practice, the combination of VM and fairly large storage capacity mean that memory exhaustion isn't really a meaningful failure point anymore. That is, while it's possible that you could run out of storage and lose the ability to allocate memory (and, yes, this is a situation the system will deal with), the more common case is that "something else" will render the system unusable well before that point. Before SSDs became common, "swap death" was a common failure mode on most Unix machines, which was caused by the demand for VM swapping consuming some much processing time that forward progress became impossible.

That leads to here:

Eventually we see kernel panics in either userspace watchdog timeout: no successful checkins from WindowServer (2 induced crashes) in 120 second or apcie[2:lan-1gb]::handleCompletionTimeoutInterrupt: completion timeout

Watchdog panics like this are actually the solution to a paradox that ongoing kernel improvement have created over time. In it's most basic form, a kernel panic is caused by the code in the kernel encountering a condition that it had no good "solution" to at that particular execution point. Assuming that condition cannot be removed (which it generally can't), solving that panic then means modifying that code path/architecture such that the condition is exported to user space where it can then be addressed. In the simplest variant of this, you can imagine a syscall returning an error when it would previously have panic'd, "solving" the kernel panic.

Unfortunately, the paradox here is what happen if that process responds by retrying the same syscall without the condition being resolved? Practically, the result has been that a kernel panic (very bad) has (potentially) been replaced with a permanently hung system which doesn't produce any diagnostic data (even worse). From the users perspective nothing has changed (there computer still doesn't work), so we've replaced one bad problem we knew about with a different problem we DON'T know about. Even worse, the scenario I outlined above is the wildly over simplified version of how something like that. Real world issues will be for more complex making good diagnostic data even more important.

That reality is what the panic above tries to address. What's actually going on here is that the kernel is monitoring system critical process by requiring them to regularly "check-in" with the kernel. Failing to so indicates that the system is not longer responsive, so the kernel triggers a panic as the simplest mechanism for collecting the diagnostic data necessary for us to resolve the hang.

Related to that point:

which I believe are caused by high system load and the kernel becoming unresponsive while the kernel is doing page compressions.

Keep in mind that there isn't any specific "cause" for a watchdog panic, as it was triggered based on the kernel's interactions with user space. I certainly wouldn't assume that memory was involved unless there is some other very clear indicator.

Any suggestions on how to prevent these kernel panics?

One thing that did jump out at me is the combination of:

apcie[2:lan-1gb]::handleCompletionTimeoutInterrupt: completion timeout

Note, "lan-1gb", which means this is the ethernet controller, which you're obviously putting under heavy load:

There are between 50 to 100 incoming RTSP streams with multiple webrtc connections, so lots of network and memory bandwidth is being used.

One thing I would try here is using the Network Link Conditioner to artificially constrain network bandwidth. If this is caused by network load then, in theory, reducing activity would prevent the panic.

*That doesn't mean you should ignore the problem or assume we'll fix it. As fix may not be practical or feasible from our side and, more importantly, fixing our panic doesn't mean your app will work. As the most extreme example, the kernel is free to solves its panic by killing your process, which doesn't really help your product.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

We have multiple kernel panic recordings available, but they are too large to upload here. We are also having multiple kernel panics per day while running this application.

Please file a bug on this and then post the bug number back here. My initial thoughts are below, but I'd like to see the data you have and, by definition, all panics are "bugs"*.

So, let me start with some background and comments on this point:

If the system is out of memory, shouldn't our application crash with an out-of-memory and the kernel should NOT panic?

First off, in practice, the combination of VM and fairly large storage capacity mean that memory exhaustion isn't really a meaningful failure point anymore. That is, while it's possible that you could run out of storage and lose the ability to allocate memory (and, yes, this is a situation the system will deal with), the more common case is that "something else" will render the system unusable well before that point. Before SSDs became common, "swap death" was a common failure mode on most Unix machines, which was caused by the demand for VM swapping consuming some much processing time that forward progress became impossible.

That leads to here:

Eventually we see kernel panics in either userspace watchdog timeout: no successful checkins from WindowServer (2 induced crashes) in 120 second or apcie[2:lan-1gb]::handleCompletionTimeoutInterrupt: completion timeout

Watchdog panics like this are actually the solution to a paradox that ongoing kernel improvement have created over time. In it's most basic form, a kernel panic is caused by the code in the kernel encountering a condition that it had no good "solution" to at that particular execution point. Assuming that condition cannot be removed (which it generally can't), solving that panic then means modifying that code path/architecture such that the condition is exported to user space where it can then be addressed. In the simplest variant of this, you can imagine a syscall returning an error when it would previously have panic'd, "solving" the kernel panic.

Unfortunately, the paradox here is what happen if that process responds by retrying the same syscall without the condition being resolved? Practically, the result has been that a kernel panic (very bad) has (potentially) been replaced with a permanently hung system which doesn't produce any diagnostic data (even worse). From the users perspective nothing has changed (there computer still doesn't work), so we've replaced one bad problem we knew about with a different problem we DON'T know about. Even worse, the scenario I outlined above is the wildly over simplified version of how something like that. Real world issues will be for more complex making good diagnostic data even more important.

That reality is what the panic above tries to address. What's actually going on here is that the kernel is monitoring system critical process by requiring them to regularly "check-in" with the kernel. Failing to so indicates that the system is not longer responsive, so the kernel triggers a panic as the simplest mechanism for collecting the diagnostic data necessary for us to resolve the hang.

Related to that point:

which I believe are caused by high system load and the kernel becoming unresponsive while the kernel is doing page compressions.

Keep in mind that there isn't any specific "cause" for a watchdog panic, as it was triggered based on the kernel's interactions with user space. I certainly wouldn't assume that memory was involved unless there is some other very clear indicator.

Any suggestions on how to prevent these kernel panics?

One thing that did jump out at me is the combination of:

apcie[2:lan-1gb]::handleCompletionTimeoutInterrupt: completion timeout

Note, "lan-1gb", which means this is the ethernet controller, which you're obviously putting under heavy load:

There are between 50 to 100 incoming RTSP streams with multiple webrtc connections, so lots of network and memory bandwidth is being used.

One thing I would try here is using the Network Link Conditioner to artificially constrain network bandwidth. If this is caused by network load then, in theory, reducing activity would prevent the panic.

*That doesn't mean you should ignore the problem or assume we'll fix it. As fix may not be practical or feasible from our side and, more importantly, fixing our panic doesn't mean your app will work. As the most extreme example, the kernel is free to solves its panic by killing your process, which doesn't really help your product.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Kevin,

Please see feedback report : 16535059

https://feedbackassistant.apple.com/feedback/16535059

Thanks, Jeremy

Odd memory usage in user space application causing kernel panics
 
 
Q