Crashes after exceeding limit of 150 wakeups per second over 300 seconds

my macOS process is crashing when i keep my mac sleep for about 1 hour. Under Crash Reports, Console app shows two type of file.

  1. .diag type
  2. .ips type

ips file doesn't shows exact line of crash as you can see below sample.

 Thread 67 Crashed:
0   libsystem_kernel.dylib                   0x19a4aea60 __pthread_kill + 8
1   libsystem_pthread.dylib                  0x19a4e6c20 pthread_kill + 288
2   libsystem_c.dylib                        0x19a3f3a30 abort + 180
3   libsystem_malloc.dylib                   0x19a303dc4 malloc_vreport + 896
4   libsystem_malloc.dylib                   0x19a307430 malloc_report + 64
5   libsystem_malloc.dylib                   0x19a321494 find_zone_and_free + 528
6   Firewall.so                              0x103c8a744 TunnelSendQueue::ResumeSend() + 460

resource.diag file sgowing warning about exceeding limit of 150 wakeups per second over 300 seconds. attached here.

is this something macOS stopping app because of some resource consumptions?

Answered by DTS Engineer in 789922022

Splitting my answers up and starting with the crash here:

5 libsystem_malloc.dylib 0x19a321494 find_zone_and_free + 528

libmalloc is one of our opensource projects, so you can often find exactly what failed/printed by looking at the source itself. There is only one error message there, so find_zone_and_free in malloc.c printed this:

		malloc_report(flags,
				"*** error for object %p: pointer being freed was not allocated\n", ptr);

Which means there is some kind of memory corruption going on.


-Kevin Elliott
DTS Engineer, CoreOS/Hardware

Accepted Answer

Splitting my answers up and starting with the crash here:

5 libsystem_malloc.dylib 0x19a321494 find_zone_and_free + 528

libmalloc is one of our opensource projects, so you can often find exactly what failed/printed by looking at the source itself. There is only one error message there, so find_zone_and_free in malloc.c printed this:

		malloc_report(flags,
				"*** error for object %p: pointer being freed was not allocated\n", ptr);

Which means there is some kind of memory corruption going on.


-Kevin Elliott
DTS Engineer, CoreOS/Hardware

Next, on the "resource.diag" file, let me start with the immediate concern here:

is this something macOS stopping app because of some resource consumptions?

macOS (or any of our other systems) did not stop your app. As far as the user is concerned, this log didn't have any effect on your app. It's easy to overlook, but what this line means:

Action taken:     none

Is that the the system generated that log your looking at but took no other action. There are other situations which generate the same log type, but the log will always say "none" unless we did "something".

Moving to the details of the log itself, I'll start here:

 Wakeups:          45001 wakeups over the last 296 seconds (152 wakeups per second average), exceeding limit of 150 wakeups per second over 300 seconds

Quinn has a good write up of what "wakeup" means which is worth reviewing.

That leads to the spindump itself. Spindumps work by capturing a fixed number of thread stacks over a specific period of time. This is what's in the log here:

 Duration Sampled: 80.48s
 Steps:            18

That is, the kernel captured this thread stach 18 times over an ~80s time window. For display purposes, all of those samples are then "merged together" to created the nested stack you're looking at. You'll notice it starts with:

  18  thread_start + 8 (libsystem_pthread.dylib + 7740) [0x18da10e3c]
    18  _pthread_start + 136 (libsystem_pthread.dylib + 28724) [0x18da16034]

...meaning "all 18 tread stacks start with thread_start-> _pthread_start", because how a running pthread always starts. The nesting process continues from there, "breaking up" each of those 18 samples in to the different code paths each sample followed. Look at the next level, that shows:

      10  void* std::__1::__thread_proxy>, Mowgli::EventCore::ThreadPool::AddPermanentThread(Mowgli::EventCore::ThreadPool::PermanentThreadId&, std::__1::function const&, std::__1::function const&)::$_1>>(void*) + 84 (libmowglicore.dylib + 6842980) [0x105bbea64]

      6   void* std::__1::__thread_proxy>, Mowgli::Log::LogList::StartThreading()::$_1>>(void*) + 84 (libmowglicore.dylib + 4978248) [0x1059f7648]

      1   void* std::__1::__thread_proxy>, Mowgli::EventCore::ThreadPool::Start()::$_0>>(void*) + 84 (libmowglicore.dylib + 6824596) [0x105bba294]

      1   boost::(anonymous namespace)::thread_proxy(void*) + 180 (libboost_thread.dylib + 10552) [0x100eaa938]

In other words, you have 18 samples (10 + 6 + 1 + 1) but across 4 different code paths instead of 1.

Taking that down one more level, following "10" above, you have another "direct" nest chain:

      10  void* std::__1::__thread_proxy>, Mowgli::EventCore::ThreadPool::AddPermanentThread(Mowgli::EventCore::ThreadPool::PermanentThreadId&, std::__1::function const&, std::__1::function const&)::$_1>>(void*) + 84 (libmowglicore.dylib + 6842980) [0x105bbea64]
        10  void std::__1::__thread_execute>, Mowgli::EventCore::ThreadPool::AddPermanentThread(Mowgli::EventCore::ThreadPool::PermanentThreadId&, std::__1::function const&, std::__1::function const&)::$_1>(std::__1::tuple>, Mowgli::EventCore::ThreadPool::AddPermanentThread(Mowgli::EventCore::ThreadPool::PermanentThreadId&, std::__1::function const&, std::__1::function const&)::$_1>&, std::__1::__tuple_indices<>) + 32 (libmowglicore.dylib + 6844020) [0x105bbee74]
          10  decltype(static_cast const&, std::__1::function const&)::$_1>(fp)()) std::__1::__invoke const&, std::__1::function const&)::$_1>(Mowgli::EventCore::ThreadPool::AddPermanentThread(Mowgli::EventCore::ThreadPool::PermanentThreadId&, std::__1::function const&, std::__1::function const&)::$_1&&) + 24 (libmowglicore.dylib + 6844056) [0x105bbee98]
            10  Mowgli::EventCore::ThreadPool::AddPermanentThread(Mowgli::EventCore::ThreadPool::PermanentThreadId&, std::__1::function const&, std::__1::function const&)::$_1::operator()() const + 24 (libmowglicore.dylib + 6844152) [0x105bbeef8]
              10  std::__1::function::operator()() const + 24 (libmowglicore.dylib + 5535108) [0x105a7f584]
                10  std::__1::__function::__value_func::operator()() const + 60 (libmowglicore.dylib + 5558896) [0x105a85270]

Which then splits into:


                  7   std::__1::__function::__func const&)::$_2, std::__1::allocator const&)::$_2>, void ()>::operator()() + 28 (libmowglicore.dylib + 6585220) [0x105b7fb84]

                  3   std::__1::__function::__func const&)::$_4, std::__1::allocator const&)::$_4>, void ()>::operator()() + 28 (libmowglicore.dylib + 6599464) [0x105b83328]

...this time splitting up the 10 samples (7 + 3) into two code paths.

Whether or not this kind of activity represents a "problem" for your app depends a lot on what's actually going on. The total count (45001) does sound problematic, but if you divide out the rate (450001/296) you're only slightly over (~152) the limit and I've seen much worse cases. However, the really answer here is that it depends on what your app was actually "doing" that put it over this limit. Many of the apps that hit this are doing so ENTIRELY by accident. For example, I've seen case where that app was messing between threads, but the messaging was being done for every single transaction/transfer/modification instead of being "throttled" by wall time or any other constraint.

I've fallen into this trap myself. ~Twenty years ago I was working on an app that implemented it's own copy engine when a coworker "hinted" that something was seemed off about my overall copy performance. Once I shifted to only sending progress updates every 0.1s, copies were suddenly 5x faster.

Similarly, there may be space for optimization and tuning, even if the basic "job" you're trying to do will inherently involve frequent wakeups.


-Kevin Elliott
DTS Engineer, CoreOS/Hardware

Crashes after exceeding limit of 150 wakeups per second over 300 seconds
 
 
Q