Symbolicating kernel backtraces on Apple Silicon

I am able to symbolicate kernel backtraces for addresses that belong to my kext.

Is it possible to symbolicate kernel backtraces for addresses that lie beyond my kext and reference kernel code?

Answered by DTS Engineer in 836958022

Let me step back to the beginning and explain this in a bit more detail:

  1. When using "-l", what atos is actually "getting" there is the load address starting from the first __TEXT segment in the executable.

  2. In arm64, (very generally) things shifted about code was loading out of __TEXT_EXEC, NOT _TEXT.

SO, to correctly symbolicate, you need to properly shift the load address to account for that. Starting with loadable KEXTs, this is fairly easy. They're __TEXT segment is at 0, so the math here is:

(__TEXT_EXEC.vmaddr) - (__TEXT.vmaddr) ->

KEXTs also tend to have a fixed opening __TEXT size, so the value of __TEXT_EXEC.vmaddr ends up being "0x4000"

0x4000 - 0 -> 0x4000

That is where my original command came from:

atos -arch arm64e -o <symbol file path)  -l <KEXT load address - 0x4000>  <address to symbol>

NOTE: My earlier post was wrong, you need to subtract not add.

For kext, that load address comes directly from the panic log. Here is what that looks like in the panic from the other thread on this issue:

          lr: 0xfffffe0013723894  fp: 0x0000000000000000
      Kernel Extensions in backtrace:
         com.company.product(1.4.21d119)[92BABD94-80A4-3F6D-857A-3240E4DA8009]@0xfffffe001203bfd0->0xfffffe00120533ab
            dependency: com.apple.iokit.IOPCIFamily(2.9)[6D6666E6-340F-3A5E-9464-DE05164C0658]@0xfffffe0015e65e90->0xfffffe0015e93b3f

Now, the math is actually exactly the same for the kernel, but the numbers are totally different. Running otool -l on kernel.release.t8112 gets you:

__TEXT.vmaddr -> 
...
Load command 0
      cmd LC_SEGMENT_64
  cmdsize 552
  segname __TEXT
   vmaddr 0xfffffe0007004000
   vmsize 0x0000000000100000
...

AND:

__TEXT_EXEC.vmaddr:
...
Load command 2
      cmd LC_SEGMENT_64
  cmdsize 312
  segname __TEXT_EXEC
   vmaddr 0xfffffe00072b8000
   vmsize 0x00000000008f0000
...

(You may notice there are multiple __TEXT_EXEC section, in this case all we care about is the first)

SO, plugging that that into our math above:

(__TEXT_EXEC.vmaddr) - (__TEXT.vmaddr) ->
0xfffffe00072b8000 - 0xfffffe0007004000 -> 0x2B4000

That number is then subtracted from:

Kernel text exec base:  0xfffffe001cd8c000

0xfffffe001cd8c000 - 0x2B4000 -> 0xFFFFFE001CAD8000

Plugging that value into your atos command:

atos -o kernel.release.t8112 -arch arm64e -l 0xFFFFFE001CAD8000 0xfffffe001cde8ab8 0xfffffe001cf44894 0xfffffe001cf42a90 0xfffffe001cd938c0 0xfffffe001cde83b0 0xfffffe001d66d670 0xfffffe001d86a708 0xfffffe001d86941c 0xfffffe001d866504 0xfffffe001e234bc8 0xfffffe001cf4611c 0xfffffe001cd93938 0xfffffe001cee0a80 0xfffffe001cee0a80 0xfffffe001d38e408 0xfffffe001d3e9304 0xfffffe001ce8c24c 0xfffffe001cedf514 0xfffffe001cd9cc04

Gives us:

handle_debugger_trap (in kernel.release.t8112) (debug.c:1839)
handle_uncategorized (in kernel.release.t8112) (sleh.c:1591)
sleh_synchronous (in kernel.release.t8112) (sleh.c:1477)
fleh_synchronous (in kernel.release.t8112) + 44
panic_trap_to_debugger (in kernel.release.t8112) (debug.c:1400)
panic_with_options_and_initiator (in kernel.release.t8112) (debug.c:1213)
0xfffffe001d86a708
0xfffffe001d86941c
0xfffffe001d866504
0xfffffe001e234bc8
sleh_irq (in kernel.release.t8112) (sleh.c:3840)
fleh_irq (in kernel.release.t8112) + 64
vm_object_upl_request (in kernel.release.t8112) (vm_pageout.c:6213)
vm_object_upl_request (in kernel.release.t8112) (vm_pageout.c:6213)
ubc_create_upl_kernel (in kernel.release.t8112) (ubc_subr.c:2643)
vnode_pageout (in kernel.release.t8112) (vnode_pager.c:407)
vnode_pager_data_return (in kernel.release.t8112) (bsd_vm.c:451)
vm_pageout_iothread_external_continue (in kernel.release.t8112) (vm_pageout.c:4237)
Call_continuation (in kernel.release.t8112) + 196

Note the references to "panic..." which are an easy way to validate that the symbolication is valid. By it's nature, the kernel cannot "crash" in the same sense any other process can, as the kernel IS the core component that controls all of the mechanisms that cause OTHER processes to crash. Because of that, "all" kernel panics end in a call to "panic" because that's the function that ends up generating the log data your looking at and then stopping normal execution.

Anticipating your question, no, I don't know why the other 4 frame aren't symbolicating, but I don't think it's an issue with atos. Apple obviously has our own internal symbolication tools and they're getting exactly the same result. Are they from your KEXT and you just didn't include the KEXT address info?

Having said that, they also don't really matter for this particular panic. That's because of this message at the top of the log:

panic(cpu 4 caller 0xfffffe001d86a708): watchdog timeout: no checkins from watchdogd in 90 seconds (25 total checkins since monitoring last enabled)

This message basically means exactly what it sounds like. There's system in place (managed by "watchdogd") to verify that user space that requires specific, critical system daemons to "checkin" with the kernel. If they fail to do so, then the system times out and the kernel panics. That dynamic also means that the specific stack traces is largely irrelevant, as it simply shows what happened to be executing at the time the panic occurred, NOT what actually caused the panic.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Based on the symbolicated backtraces, there are no references to where the memory is being used after being freed, or where it's being freed.

There's just this reference to where it's being allocated:

['atos', '-o', '/Users/user/panic/Kernel_Debug_Kit_15.6_build_24G84.dmg_extracted/KDK.pkg/Payload/System/Library/Kernels/kernel.release.t8103.dSYM/Contents/Resources/DWARF/kernel.release.t8103', '-arch', 'arm64e', '-textExecAddress', '0xfffffe0025c50000', '0xfffffe0025cae7c0', '0xfffffe0026558328', '0xfffffe002655e730', '0xfffffe0025d30f08', '0xfffffe0025cc21c0', '0xfffffe002620afe4', '0xfffffe00243f717c', '0xfffffe00243f7110', '0xfffffe00243df12c', '0xfffffe00243df610', '0xfffffe0024408bc0', '0xfffffe0025ee60cc', '0xfffffe0025ecdf08', '0xfffffe0025eceee4', '0xfffffe002635e9b8', '0xfffffe0025e0fff0', '0xfffffe0025c57b88']
panic_trap_to_debugger (in kernel.release.t8103) (debug.c:1400)
kalloc_type_distribute_budget (in kernel.release.t8103) (kalloc.c:1343)
kmem_init (in kernel.release.t8103) (vm_kern.c:4584)
zone_early_scramble_rr (in kernel.release.t8103) (zalloc.c:3220)
kalloc_ext (in kernel.release.t8103) (kalloc.c:2520)
mprotect (in kernel.release.t8103) (kern_mman.c:1311)
0xfffffe00243f717c
0xfffffe00243f7110
0xfffffe00243df12c
0xfffffe00243df610
0xfffffe0024408bc0
vn_pathconf (in kernel.release.t8103) (vfs_vnops.c:1875)
openat_dprotected_internal (in kernel.release.t8103) (vfs_syscalls.c:5127)
mknod (in kernel.release.t8103) (vfs_syscalls.c:5469)
sendsig (in kernel.release.t8103) (unix_signal.c:626)
sleh_synchronous (in kernel.release.t8103) (sleh.c:1267)
fleh_synchronous (in kernel.release.t8103) + 24

The unsymbolicated memory addresses come from my KEXT where memory is allocated via OSMalloc(). The object being allocated ends up being stored in an RB tree.

I'd like to take you up on your offer to take a look at the full panic log for me if I may.

Please find the full panic log attached to this ticket https://feedbackassistant.apple.com/feedback/21231833.

I'd like to take you up on your offer to take a look at the full panic log for me if I may.

Alright. So, here is a rundown of what I found:

(1) There are five different threads blocked in some variation of this:

"kernelFrames": [
	["thread_invoke", "kernel.release.t8103", "(sched_prim.c:3530)", "0xfffffe0008a6b0e8", 1180], 
	["thread_block_reason", "kernel.release.t8103", "(sched_prim.c:4147)", "0xfffffe0008a69b5c", 244], 
	["lck_mtx_sleep_deadline", "kernel.release.t8103", "(locks.c:1224)", "0xfffffe0008a526a0", 236], 
	["_sleep", "kernel.release.t8103", "(kern_synch.c:206)", "0xfffffe0008fb145c", 176], 
...
	[11,342632], 
	["Call_continuation", "kernel.release.t8103", null, "0xfffffe00089e4f0c", 268]
], 

The intermediate frames going through your code varies, but they all start at the entry point above and block in sleep. If you search for "[11,342632]" in the raw panic, you should be able to find all 5 and reconstruct them from there.

(2) All other activity came from Preview(2407). There were three threads tied to your KEXT:

  1. One thread is doing handling I/O from the VM system, caused by memmove, probably tied to memory-mapped file I/O:
"userFrames": [
	["_platform_memmove", "libsystem_platform.dylib", null, "0x1804ad320", 96], 
	["CGPDFSourceRefill", "CoreGraphics", "(CGPDFSource.c:464)", "0x1869dfff8", 164], 

"kernelFrames": [
	["thread_invoke", "kernel.release.t8103", "(sched_prim.c:3530)", "0xfffffe0008a6b0e8", 1180], 
	["thread_block_reason", "kernel.release.t8103", "(sched_prim.c:4147)", "0xfffffe0008a69b5c", 244], 
	["lck_rw_lock_shared_gen", "kernel.release.t8103", "(lock_rw.c:1375)", "0xfffffe0008a4f404", 892], 
	["IORWLockRead", "kernel.release.t8103", "(lock_rw.c:1524)", "0xfffffe0008a4f030", 144], 
	[11, 315180], 
	[11, 316328], 
	[11, 70956], 
	[11, 70608], 
	[11, 236500], 
	["cluster_io", "kernel.release.t8103", "(vfs_cluster.c:2336)", "0xfffffe0008c1fb74", 3512], 
	["cluster_pagein_ext", "kernel.release.t8103", "(vfs_cluster.c:2736)", "0xfffffe0008c219b8", 244], 
	["cluster_pagein", "kernel.release.t8103", "(vfs_cluster.c:2668)", "0xfffffe0008c218b8", 28], 
	[11, 237688], 
	["vnode_pagein", "kernel.release.t8103", "(vnode_pager.c:797)", "0xfffffe000904c67c", 1068], 
	["vnode_pager_cluster_read", "kernel.release.t8103", "(bsd_vm.c:860)", "0xfffffe0008ada1b4", 76], 
	["vm_fault_page", "kernel.release.t8103", "(vm_fault.c:2019)", "0xfffffe0008af3c88", 2348], 
	["vm_fault_internal", "kernel.release.t8103", "(vm_fault.c:6299)", "0xfffffe0008aef600", 6688], 
	["handle_user_abort", "kernel.release.t8103", "(sleh.c:2981)", "0xfffffe0008b94bdc", 420], 
	["sleh_synchronous", "kernel.release.t8103", "(sleh.c:1271)", "0xfffffe0008b94030", 944], 
	["fleh_synchronous", "kernel.release.t8103", null, "0xfffffe00089dbb88", 24], 
	[2, 0]
], 
  1. Here is the panic itself, which was triggered by an open() call. One detail I will share is that the open itself wasn't actually opening the file contents (it was going to read and xattr), so the flags might be considered unusual:
open(..., O_NOFOLLOW|O_RDONLY|O_EVTONLY);

"userFrames": [
	["__open", "libsystem_kernel.dylib", null, "0x180432678", 8], 


"kernelFrames": [
	["panic_trap_to_debugger", "kernel.release.t8103", "(debug.c:1400)", "0xfffffe0008a327c0", 780], 
	["panic", "kernel.release.t8103", "(debug.c:1161)", "0xfffffe00092dc328", 60], 
	["zalloc_uaf_panic", "kernel.release.t8103", "(zalloc.c:3173)", "0xfffffe00092e2730", 236], 
	["zalloc_item", "kernel.release.t8103", "(zalloc.c:6760)", "0xfffffe0008ab4f08", 1440], 
	["kalloc_ext", "kernel.release.t8103", "(kalloc.c:2520)", "0xfffffe0008a461c0", 296], 
	["OSMalloc", "kernel.release.t8103", "(kern_malloc.c:270)", "0xfffffe0008f8efe4", 168], 
	[11, 160124], 
	[11, 160016], 
	[11, 61740], 
	[11, 62992], 
	[11, 232384], 
	["vn_open_auth", "kernel.release.t8103", "(vfs_vnops.c:705)", "0xfffffe0008c6a0cc", 2404], 
	["open1", "kernel.release.t8103", "(vfs_syscalls.c:4812)", "0xfffffe0008c51f08", 488], 
	["open", "kernel.release.t8103", "(vfs_syscalls.c:5313)", "0xfffffe0008c52ee4", 288], 
	["unix_syscall", "kernel.release.t8103", "(systemcalls.c:181)", "0xfffffe00090e29b8", 772], 
	["sleh_synchronous", "kernel.release.t8103", "(sleh.c:1267)", "0xfffffe0008b93ff0", 880], 
	["fleh_synchronous", "kernel.release.t8103", null, "0xfffffe00089dbb88", 24], 
	[2, 0]
], 
  1. Another thread is also blocked in open. That open appears to be an O_RDONLY call tied to bookmark creation:
"kernelFrames": [
	["thread_invoke", "kernel.release.t8103", "(sched_prim.c:3530)", "0xfffffe0008a6b0e8", 1180], 
	["thread_block_reason", "kernel.release.t8103", "(sched_prim.c:4147)", "0xfffffe0008a69b5c", 244], 
	["lck_rw_sleep", "kernel.release.t8103", "(lock_rw.c:2683)", "0xfffffe0008a509c4", 132], 
	["vm_object_update", "kernel.release.t8103", "(memory_object.c:1024)", "0xfffffe0008aecc64", 2296], 
	["memory_object_lock_request", "kernel.release.t8103", "(memory_object.c:352)", "0xfffffe0008aec228", 188], 
	["ubc_msync", "kernel.release.t8103", "(ubc_subr.c:1818)", "0xfffffe0008fef174", 188], 
	[11, 232440], 
	["vn_open_auth", "kernel.release.t8103", "(vfs_vnops.c:705)", "0xfffffe0008c6a0cc", 2404], 
	["open1", "kernel.release.t8103", "(vfs_syscalls.c:4812)", "0xfffffe0008c51f08", 488], 
	["open", "kernel.release.t8103", "(vfs_syscalls.c:5313)", "0xfffffe0008c52ee4", 288], 
	["unix_syscall", "kernel.release.t8103", "(systemcalls.c:181)", "0xfffffe00090e29b8", 772], 
	["sleh_synchronous", "kernel.release.t8103", "(sleh.c:1267)", "0xfffffe0008b93ff0", 880], 
	["fleh_synchronous", "kernel.release.t8103", null, "0xfffffe00089dbb88", 24], 
	[2, 0]
], 

4. I also noticed that Preview has another thread blocked in vn_open_auth, this time waiting on a response from EndpointSecurity:

"kernelFrames": [
	["thread_invoke", "kernel.release.t8103", "(sched_prim.c:3530)", "0xfffffe0008a6b0e8", 1180], 
	["thread_block_reason", "kernel.release.t8103", "(sched_prim.c:4147)", "0xfffffe0008a69b5c", 244], 
	["lck_mtx_sleep", "kernel.release.t8103", "(locks.c:1169)", "0xfffffe0008a524a0", 232], 
	...
	["EndpointSecurityEventManager::es_vnode_check_open(ucred*, vnode*, label*, int)", "EndpointSecurity", "(esk_event_manager.cpp:1709)", "0xfffffe000a679348", 68], 
	["mac_vnode_check_open", "kernel.release.t8103", "(mac_vfs.c:1637)", "0xfffffe00092b9368", 204], 
	["vn_authorize_open_existing", "kernel.release.t8103", "(vfs_subr.c:8349)", "0xfffffe0008c447d0", 140], 
	["vn_open_auth", "kernel.release.t8103", "(vfs_vnops.c:669)", "0xfffffe0008c69f3c", 2004], 
	["open1", "kernel.release.t8103", "(vfs_syscalls.c:4812)", "0xfffffe0008c51f08", 488], 
	["open", "kernel.release.t8103", "(vfs_syscalls.c:5313)", "0xfffffe0008c52ee4", 288], 
	["unix_syscall", "kernel.release.t8103", "(systemcalls.c:181)", "0xfffffe00090e29b8", 772], 
	["sleh_synchronous", "kernel.release.t8103", "(sleh.c:1267)", "0xfffffe0008b93ff0", 880], 
	["fleh_synchronous", "kernel.release.t8103", null, "0xfffffe00089dbb88", 24], 
	[2, 0]
], 

(3)

Covering a few odds and ends:

  • I get concerned anytime I see EndpointSecurity. The API basically allows a user space client to "veto" a broad variety of critical syscalls which, as you can imagine, opens the door to all sorts of mischief. Critically, alongside the obvious risk of things failing unexpectedly, it also introduces timing changes which could have other, unforeseen consequences. It could be totally unrelated to your issue, but I'd at least make sure I knew what "else" was happening on the machine.

  • I don't really see very much "else" happening on the system. The OSMalloc call which panicked appears to be the only memory-related activity on the system. I was so surprised by that that I searched for "alloc" (looking for malloc/vm_allocate/etc.) and basically found "nothing". Similarly, the threads above are the only open calls happening on the entire system.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks very much for looking into this for me.

I'll take a closer look at what I'm doing with the values stored in the RB tree in my filesystem.

I might need to resort to interprocess synchronisation to guard the critical sections in that code.

Symbolicating kernel backtraces on Apple Silicon
 
 
Q