Built-in Spin Control?

On and off I've been trying to figure out how to do hang detection in-application (at least from the user's point of view). Qualitatively what I'd like to do is have a process which runs sample(1) on the application after it's been unresponsive for more than a second or so. Basically, an in-app replacement for Spin Control. The problem I've been stuck on is: how do I tell?

There used to be Core Graphics SPI (CGSRegisterNotifyProc with a value of kCGSEventNotificationAppIsUnresponsive) for doing this, but it doesn't work anymore (either due to sandboxing or system-wide security changes, I can't tell which but it doesn't matter).

One thought I had was to have an XPC service which would expect to receive a checkin once per second from the host (via a timer set up by the host). If it didn't, it would start sample(1). This seems pretty heavyweight to me, since it means that once per second, I'm going to be consuming cycles to check in with the service. But I haven't been able to come up with a scheme that doesn't include some kind of check-in by the target process.

Are there any APIs or strategies that I could use to accomplish this? Or is there some entitlement which would allow the application to request "application became unresponsive"/"application became responsive" notifications from the window server?

Answered by DTS Engineer in 807212022

This is an interesting problem, and Kevin and I sat down to chat about it yesterday. We have some suggestions for you but, before I go into the details, I have a three point preface:

  • Have you look at MetricKit for this? It seems like its MXHangDiagnostic payload would be really helpful. And it’s definitely a lot easier that anything I’m suggesting here.

  • Beyond that, there’s no API for this. If you’d like to see us add something to help in this space, you should file an enhancement request describing your requirements. If you do, please post your bug number, just for the record.

  • The approach I’m going to suggest is risky. It has many of the same challenges as building a crash reporter, which is something I cover in depth in Implementing Your Own Crash Reporter. If you do go down this path, read that post carefully.


In terms of doing this yourself, here’s how I’d approach it.

  1. Start a thread that preallocates all of the necessary resources and waits for events.

  2. Add something to your run loop that pings that thread.

  3. If the thread doesn’t get a ping from the run loop within your timeout, have your thread sublaunch a helper tool.

  4. Have that helper tool run sample on your app and wrangle the result.

IMPORTANT This thread is kinda like an async signal handler. It can run at any time, including when various global locks are held. In fact, that’s exactly the sort of time when you’d expect it to run! So it can’t use Objective-C or Swift, call malloc, and so on. It has to be C or C++, and it’d be best if you restricted it to using just system calls [1].

That’s why step 1 preallocates stuff. You don’t want to call NSBundle to locate the helper tool when your process is potentially hung, so do that stuff in advance [2]. When it determines that the main thread is hung, the process should run the helper tool with posix_spawn and that’s about it.

Regarding step 2, Kevin and I had different suggestions on that front. I’d experiment with using a run loop observer. The .afterWaiting activity is a good place to ‘start’ your timer, and the .beforeWaiting activity is a good place to ‘stop’ it.

IMPORTANT I’m not literally talking about a timer. Rather, you’d do something to deactivate the waiting thread while your main thread is waiting in the run loop.

OTOH, Kevin suggested using just a timer for this. Add a slow-runner NSTimer that pings the waiting thread and that’s it. That has the disadvantage of preventing your app from suspending for long periods of time, but it’s a lot simpler and, if you choose a sufficiently long interval, it’s not going to have much impact. Kevin point out is that simplicity trump absolute efficiency in this space, and I can’t disagree with that (-:

In terms of how to communicate between threads, I’d probably use a Unix domain socket for that. The advantage of a socket is that you can send and receive messages and block with a timeout, all using system calls.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] Well, calls libSystem routines that are directly backed by a system call.

[2] Caching the path is a problem if the user moves your app while it’s running. If your app is otherwise resilient to such shenanigans, you could open the helper tool in advanced and then gets its path (using fcntl with F_GETPATH) immediately before spawning it. That’s a bit more complex, but it should be safe as long as you’re working with a preallocated buffer.

Accepted Answer

This is an interesting problem, and Kevin and I sat down to chat about it yesterday. We have some suggestions for you but, before I go into the details, I have a three point preface:

  • Have you look at MetricKit for this? It seems like its MXHangDiagnostic payload would be really helpful. And it’s definitely a lot easier that anything I’m suggesting here.

  • Beyond that, there’s no API for this. If you’d like to see us add something to help in this space, you should file an enhancement request describing your requirements. If you do, please post your bug number, just for the record.

  • The approach I’m going to suggest is risky. It has many of the same challenges as building a crash reporter, which is something I cover in depth in Implementing Your Own Crash Reporter. If you do go down this path, read that post carefully.


In terms of doing this yourself, here’s how I’d approach it.

  1. Start a thread that preallocates all of the necessary resources and waits for events.

  2. Add something to your run loop that pings that thread.

  3. If the thread doesn’t get a ping from the run loop within your timeout, have your thread sublaunch a helper tool.

  4. Have that helper tool run sample on your app and wrangle the result.

IMPORTANT This thread is kinda like an async signal handler. It can run at any time, including when various global locks are held. In fact, that’s exactly the sort of time when you’d expect it to run! So it can’t use Objective-C or Swift, call malloc, and so on. It has to be C or C++, and it’d be best if you restricted it to using just system calls [1].

That’s why step 1 preallocates stuff. You don’t want to call NSBundle to locate the helper tool when your process is potentially hung, so do that stuff in advance [2]. When it determines that the main thread is hung, the process should run the helper tool with posix_spawn and that’s about it.

Regarding step 2, Kevin and I had different suggestions on that front. I’d experiment with using a run loop observer. The .afterWaiting activity is a good place to ‘start’ your timer, and the .beforeWaiting activity is a good place to ‘stop’ it.

IMPORTANT I’m not literally talking about a timer. Rather, you’d do something to deactivate the waiting thread while your main thread is waiting in the run loop.

OTOH, Kevin suggested using just a timer for this. Add a slow-runner NSTimer that pings the waiting thread and that’s it. That has the disadvantage of preventing your app from suspending for long periods of time, but it’s a lot simpler and, if you choose a sufficiently long interval, it’s not going to have much impact. Kevin point out is that simplicity trump absolute efficiency in this space, and I can’t disagree with that (-:

In terms of how to communicate between threads, I’d probably use a Unix domain socket for that. The advantage of a socket is that you can send and receive messages and block with a timeout, all using system calls.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] Well, calls libSystem routines that are directly backed by a system call.

[2] Caching the path is a problem if the user moves your app while it’s running. If your app is otherwise resilient to such shenanigans, you could open the helper tool in advanced and then gets its path (using fcntl with F_GETPATH) immediately before spawning it. That’s a bit more complex, but it should be safe as long as you’re working with a preallocated buffer.

The approach I’m going to suggest is risky. It has many of the same challenges as building a crash reporter, which is something I cover in depth in Implementing Your Own Crash Reporter. If you do go down this path, read that post carefully.

The thing I'd really emphasize here is that what really makes this kind of thing risky is the "weird stuff". My all time "favorite" investigation was on a kiosk app that would hang in the foreground after running continuously for 6-8 MONTHS. I eventually determined that this was caused by a very slow mach port leak, caused by code he'd specifically added to track memory use... to try and improve his app.

This is what rules out fancy things like XPC or trying to track state from an external helper. You can build something really cool that way and it may work perfectly... but the worst thing this kind of code can do is introduce a weird/random/rare failure that you'll then spend months fighting.

Similarly, some of the intuitions about what's safe/unsafe here can be counterintuitive. Directly launching a helper process seems more complicated/dangerous than an XPC message, however, you can prep all the in advance (so you're not allocating anything) and posix_spawn itself is a straight syscall. There are lot of things your app could inadvertently do that would interfere with XPC and VERY few that would break posix_spawn.

IMPORTANT I’m not literally talking about a timer. Rather, you’d do something to deactivate the waiting thread while your main thread is waiting in the run loop.

Double underline on this. IPC is not your friend. Honestly, my first thought was sleep() and a volatile integer. However, maybe a dispatch_semaphore* might actually be a decent approach?

Your main thread can use a sentinel value to tell the monitor thread that it needs to "sleep longer" and your monitor thread then increases it's wait time based on that. When the main thread wakes up again, it uses dispatch_semaphore_signal to wake the monitor and the normal monitor cycle resumes. The key here is that you don't think of dispatch_semaphore as a communication API, but as a "fancier" version of sleep that lets you wake up the monitor thread. I'd need to look at the implementation to really be sure about how safe this actually is, but my intuition is that the timeout out here isn't really any different than sleep.

*I believe this is in fact the first time I've ever recommended using dispatch_semaphore, so my feelings about this approach are very ambiguous.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Built-in Spin Control?
 
 
Q