Hello guys,
We are receiving feedbacks from various users facing kernel panics when using one of our products. Our analysis of the crash reports shows that all panic traces report the exact same panic cause:
Sleep transition timed out after 35 seconds while creating hibernation file or while calling rootDomain's clients about upcoming rootDomain's state changes.
Various versions of MacOS are affected, including the latest ones.
It seems obvious, with the user feedbacks we have, that our product plays a role in those KP. But we can seen on the forums that it is not specific to our users.
Our product does use not-so-common APIs (it uses the EndpointSecurity API in AUTH mode for some events notalby), and it can have a pretty important IO activity on disk, with a memory footprint of multiple hundreds of MB.
My understanding of hibernation is that when it happens, the applications are frozen (i.e. with no access to the CPU), and thus that no endpoint security event would be generated during the hibernation process. As a consequence, we did not implement any specific behavior for hibernation. Do you think it is a valid assumption ?
Our product does use not-so-common APIs (it uses the EndpointSecurity API in AUTH mode for some events notalby), and it can have a pretty important IO activity on disk, with a memory footprint of multiple hundreds of MB.
First off, I need to pass along the standard warning I pass to every developer using the EndpointSecurity API. This API is easily one of the most difficult and dangerous APIs on the system. The reasons for this include (but are not limited to...):
-
The API scope is enormous, basically allow a client to disrupt almost "everything" the system does.
-
An ES client's interactions with the system can in turn generate additional auth events. Particularly if/when multiple ES clients are involved this can create arbitrarily complex recursive loops, which can obviously be... problematic.
-
The consequence of disrupting system performance are very difficult to predict but include everything from performance disruptions up to and including kernel panics.
-
All of these issue are generally invisible under simple test or "basic" load. That makes it very easy to write an ES client that appears to "work" (in the sense that it doesn't cause any immediate failure) but is in fact quite badly designed and will inevitably cause problems later.
All these factors together mean that ES clients rarely fail in "clean" ways (like simple crashes). That is, most ES client problems take the form of "My ES client works most of the time except for <specific, often weird failure> when <list of odd/random/unclear requirements>". The key thing to understand here is that the "weird failure" is rarely the real problem, but is almost always a symptom of some other structural problem in how the client process events. If you only focus on the specific failure you can end up stuck fixing an endless stream of "random" failures as the system/apps find new and interesting ways to trip over your ES client.
My understanding of hibernation is that when it happens, the applications are frozen (i.e. with no access to the CPU), and thus that no endpoint security event would be generated during the hibernation process. As a consequence, we did not implement any specific behavior for hibernation. Do you think it is a valid assumption ?
Yes.. however, if you pick apart the error message:
"while creating hibernation file"
I haven't looked at the full implementation in detail, but it's almost certainly the case that the "creating" process covers more time that JUST the process of writing the file after it's been frozen. It's very likely that there are ES "relevant" events earlier in the hibernation process.
However, more importantly:
"or while calling rootDomain's clients about upcoming rootDomain's state changes."
...you can't ignore the "or". There are lots of different power state changes beside hibernation and your ES client would have been active for most them.
As a broader comment, one thing to keep in mind when looking at our error messages is that most of them were added at a very specific point in time for specific reasons which may not actually match the specific scenario you're actually looking at. Case in point here, that particular message comes from kIOPMTracePointSleepWillChangeInterests found here. That's a very general transition that isn't particularly specific to hibernation. I suspect the reason hibernation is called out there is that that it was what we were trying to debug at the time, not that that it was specific to this error.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware