For debugging a device driver or indeed any code that resides in the kernel, two computers are a necessity. Buggy kernel code has a nasty tendency to crash or hang a system, and so directly debugging that system is often impossible.
Two-machine debugging with gdb is the main pathway to finding bugs in driver code. This section takes you through the procedure for setting up two computers for debugging, offers a few tips on using gdb in kernel code, introduces you to the kernel debugging macros, and discusses techniques for finding bugs causing kernel panics and hangs.
Setting Up for Two-Machine Debugging
Using the Kernel Debugging Macros
Tips on Using gdb
Debugging Kernel Panics
Debugging System Hangs
Debugging Boot Drivers
This section summarizes the steps required to set up two computers for kernel debugging. It draws heavily on the following documents, which you should refer to for more detailed information:
The tutorial Hello Debugger: Debugging a Device Driver with GDB. (This tutorial is part of Kernel Extension Programming Topics.)
The “When Things Go Wrong” section of Building and Debugging Kernels in Kernel Programming Guide. (This document is available on-line in Darwin Documentation.)
In two-machine debugging, one system is called the target computer and the other the host (or development) computer. The host computer is the computer that actually runs gdb. It is typically the computer on which your driver is developed—hence, it’s also referred to as the development computer. The target computer is the system on which the driver to be debugged is run. Your host and target computers should be running the same version of the Darwin kernel, or as close as possible to same version. (Of course, if you’re debugging a panic-prone version of the kernel, you’ll want the host computer to run the most recent stable version of Darwin.) For optimal source-level debugging, the host computer should have the source code of the driver, any kernel extensions related to your driver (such as its client or provider), and perhaps even the kernel itself (/xnu).
In order for two-machine debugging to be feasible, the following must be true:
For versions of Mac OS X before Mac OS X version 10.2, both computers must be on the same subnet.
You must have login access to both computers as an administrator (group admin), because you’ll need root privileges to load your KEXT (you can use the sudo(8) command).
You must be able to copy files between the computers using FTP, scp (SSH), rsync, AFP, or similar protocol or tool.
Note: The following steps include instructions on setting up a permanent network connection via ARP (step 3). This is unnecessary if you are running Mac OS X v. 10.2 or later on both machines and if you set the NVRAM debug variable to 0x144 (as in step 1). This configuration allows you to set up two-machine debugging on two computers that are not necessarily on the same subnet. If you are running an earlier version of Mac OS X, however, you do need to follow step 3 and both computers must be on the same subnet.
When all this is in place, complete the following steps:
Target Set the NVRAM debug variable to 0x144, which lets you drop into the debugger upon a non-maskable interrupt (NMI) and, if you’re running Mac OS X v. 10.2 or later, lets you debug two computers not on the same subnet. You can use setenv to set the flag within Open Firmware itself (on PowerPC-based Macintosh computers), or you can use the nvram utility. For the latter, enter the following as root at the command line:
nvram boot-args="debug=0x144" |
It’s a good idea to enter nvram boot-args (no argument) first to get any current NVRAM variables in effect; then include these variables along with the debug flag when you give the nvram boot-args command a second time. Reboot the system.
Note: If your target machine contains a PMU (for example, a PowerBook G4 or an early G5 desktop computer), you may find that it shuts down when you exit kernel debugging mode. One reason for this is that if a breakpoint causes kernel debugger entry in the middle of a PMU transaction, the PMU watchdog may trigger on a timeout and cause the machine to shut down. If you experience this, you may find that forcing the PMU driver to operate in polled mode fixes the problem. To do this, set the NVRAM variable pmuflags to 1, as shown below:
nvram boot-args="pmuflags=1"
You can set the pmuflags variable separately, as shown above, or you can set it at the same time you set the debug variable, as shown below:
boot-args="debug=0x144 pmuflags=1"
Host or Target Copy the driver (or any other kernel extension) to a working directory on the target computer.
Host Set up a permanent network connection to the target computer via ARP. The following example assumes that your test computer is target.goober.com:
$ ping -c 1 target.goober.com |
ping results: .... |
$ arp -an |
target.goober.com (10.0.0.69): 00:a0:13:12:65:31 |
$ arp -s target.goober.com 00:a0:13:12:65:31 |
$ arp -an |
target.goober.com (10.0.0.69) at00:a0:13:12:65:31 permanent |
This sequence of commands establishes a connection to the target computer (via ping), displays the information on recent connections ARP knows about (arp -an), makes the connection to the target computer permanent by specifying the Ethernet hardware address (arp -s), and issues the arp -an command a second time to verify this.
Target Create symbol files for the driver and any other kernel extensions it depends on. First create a directory to hold the symbols; then run the kextload command-line tool, specifying the directory as the argument of the -s option:
$ kextload -l -s /tmp/symbols /tmp/MyDriver.kext |
This command loads MyDriver.kext but, because of the -l option, doesn’t start the matching process yet (that happens in a later step). If you don’t want the driver to load just yet, specify the -n option along with the -s option. See “Using kextload, kextunload, and kextstat” for the kextload procedure for debugging a driver’s start-up code.
Target or Host Copy the symbol files to the host computer.
Host Optionally, if you want to debug your driver with access to all the symbols in the kernel, obtain or build a symboled kernel. For further information, contact Apple Developer Technical support. You can find the instructions for building the Darwin kernel from the Open Source code in the Building and Debugging Kernels in Kernel Programming Guide.
$ gdb /mach_kernel |
If you have a symboled kernel, specify the path to it rather than /mach_kernel. It is important that you run gdb on a kernel of the same version and build as the one that runs on the target computer. If the versions are different, you should obtain a symboled copy of the target’s kernel and use that.
Host In gdb, add the symbol file of your driver.
(gdb) add-symbol-file /tmp/symbols/com.acme.driver.MyDriver.sym |
Add the symbol files of the other kernel extensions in your driver’s dependency chain.
Host Tell gdb that you will be debugging remotely.
(gdb) target remote-kdp |
Target Break into kernel debugging mode. Depending on the model of your target system, either issue the appropriate keyboard command or press the programmer’s button. On USB keyboards, hold down the Command key and the Power button; on ADB keyboards, hold down the Control key and the Power button. If you’re running Mac OS X version 10.4 or later, hold down the following five keys: Command, Option, Control, Shift, and Escape.
You may have to hold down the keys or buttons for several seconds until you see the “Waiting for remote debugger connection” message.
Host Attach to the target computer and set breakpoints.
(gdb) attach target.goober.com |
(gdb) break 'MyDriverClass::WriteData(* char)' |
(gdb) continue |
Be sure you give the continue command; otherwise the target computer is unresponsive.
Target Start the driver running.
$ kextload -m -t /tmp/MyDriver.kext |
The -m option starts the matching process for the driver. The -t option, which tells kextload to conduct extensive validation checks, is really optional here; ideally, your driver should have passed these checks during an earlier stage of debugging (see “Using kextload, kextunload, and kextstat”). After starting the driver, perform the actions necessary to trigger the breakpoint.
Host When the breakpoint you set is triggered, you can begin debugging your driver using gdb commands. If you “source” the kernel debugging macros (see the following section, “Using the Kernel Debugging Macros”), you can use those as well.
Apple includes a set of kernel debugging macros as part of Darwin. They have been written by engineers with an intimate knowledge of how the Darwin kernel works. Although it is possible to debug driver code without these macros, they will make the task much easier.
The kernel debugging macros probe the internal structures of a running Mac OS X system in considerable depth. With them you can get summary and detailed snapshots of tasks and their threads in the kernel, including such information as thread priority, executable names, and invoked functions. The kernel debugging macros also yield information on the kernel stacks for all or selected thread activations, on IPC spaces and port rights, on virtual-memory maps and map entries, and on allocation zones. See Table 7-2 for a summary of the kernel debugging macros.
Important: The kernel debugging macros described in this section will not work unless you have a symboled kernel. You can either build the Darwin kernel from the open source (see the Building and Debugging Kernels inKernel Programming Guide for details) or you can refer to the Kernel Debug Kit, available at http://developer.apple.com/sdk, which includes a copy of the kernel debug macros.
You can obtain the kernel debugging macros from the Darwin Open Source repository. They are in the .gdbinit file in the /xnu/osfmk branch of the source tree. Because .gdbinit is the standard name of the initialization file for gdb, you might already have your own .gdbinit file to set up your debugging sessions. If this is the case, you can combine the contents of the files or have a “source” statement in one .gdbinit file that references the other file. To include the macros in a .gdbinit file for a debugging session, specify the following gdb command shortly after running gdb on mach_kernel:
(gdb) source /tmp/.gdbinit |
(In this example, /tmp represents any directory that holds the copy of the .gdbinit file you obtained from the Open Source repository.) Because the kernel debugging macros can change between versions of the kernel, make sure that you use the macros that match as closely as possible the version of the kernel you’re debugging.
Macro | Description |
|---|---|
| Displays a summary listing of tasks |
| Displays a summary listing of all activations |
| Displays the kernel stacks for all activations |
| Displays a summary listing of all the VM maps |
| Displays a summary listing of all the VM map entries |
| Displays a summary listing of all the IPC spaces |
| Displays a summary listing of all the IPC rights |
| Displays a summary listing of all the kernel extension binaries |
| Displays status of the specified task |
| Displays the status of all activations in the task |
| Displays all kernel stacks for all activations in the task |
| Displays status of the specified task's VM map |
| Displays a summary list of the task's VM map entries |
| Displays status of the specified task's IPC space |
| Displays a summary list of the task's IPC space entries |
| Displays status of the specified thread activation |
| Displays the kernel stack for the specified activation |
| Displays the status of the specified VM map |
| Displays a summary list of the specified VM map's entries |
| Displays the status of the specified IPC space |
| Displays a summary list of all the rights in an IPC space |
| Displays the status of the process identified by PID |
| Displays the status of the process identified by a proc pointer |
| Displays information about a kernel extension binary |
| Given an address, displays the kernel extension binary and offset |
| Displays zone information |
| Displays the panic log information |
| Switch thread context |
| Switch context |
| Reset context |
A subset of the kernel debugging macros are particularly useful for driver writers: showallstacks, switchtoact, showkmodaddr, showallkmods, and switchtoctx. The output of showallstacks lists all tasks in the system and, for each task, the threads and the stacks associated with each thread. Listing 7-5 shows the information on a couple tasks as emitted by showallstacks.
Listing 7-5 Example thread stacks shown by showallstacks
(gdb) showallstacks |
... |
task vm_map ipc_space #acts pid proc command |
0x00c1e620 0x00a79a2c 0x00c10ce0 2 51 0x00d60760 kextd |
activation thread pri state wait_queue wait_event |
0x00c2a1f8 0x00ccab0c 31 W 0x00c9fee8 0x30a10c <ipc_mqueue_rcv> |
continuation=0x1ef44 <ipc_mqueue_receive_continue> |
activation thread pri state wait_queue wait_event |
0x00c29a48 0x00cca194 31 W 0x00310570 0x30a3a0 <kmod_cmd_queue> |
kernel_stack=0x04d48000 |
stacktop=0x04d4bbe0 |
0x04d4bbe0 0xccab0c |
0x04d4bc40 0x342d8 <thread_invoke+1104> |
0x04d4bca0 0x344b4 <thread_block_reason+212> |
0x04d4bd00 0x334e0 <thread_sleep_fast_usimple_lock+56> |
0x04d4bd50 0x81ee0 <kmod_control+248> |
0x04d4bdb0 0x45f1c <_Xkmod_control+192> |
0x04d4be00 0x2aa70 <ipc_kobject_server+276> |
0x04d4be50 0x253e4 <mach_msg_overwrite_trap+2848> |
0x04d4bf20 0x257e0 <mach_msg_trap+28> |
0x04d4bf70 0x92078 <.L_syscall_return> |
0x04d4bfc0 0x10000000 |
stackbottom=0x04d4bfc0 |
task vm_map ipc_space #acts pid proc command |
0x00c1e4c0 0x00a79930 0x00c10c88 1 65 0x00d608c8 update |
activation thread pri state wait_queue wait_event |
0x00ddaa50 0x00ddbe34 31 W 0x00310780 0xd608c8 <rld_env+10471956> |
continuation=0x1da528 <_sleep_continue> |
The typical number of stacks revealed by showallstacks runs into the dozens. Most of the threads associated with these stacks are asleep, blocked on continuation (as is that for the second task shown in the above example). Stacks such as these you can usually ignore. The remaining stacks are significant because they reflect the activity going on in the system at a particular moment and context (as happens when an NMI or kernel panic occurs).
Thread activations and stacks in the kernel—including those of drivers—belong to the task named kernel_task (under the command column). When you’re debugging a driver, you look in the active stacks in kernel_task for any indication of your driver or its provider, client, or any other object it communicates with. If you add the symbol files for these driver objects before you begin the debugging session, the indication will be much clearer. Listing 7-6 shows an active driver-related thread in kernel_task in the context of adjacent threads.
Listing 7-6 Kernel thread stacks as shown by showallstacks
activation thread pri state wait_queue wait_event |
0x0101ac38 0x010957e4 80 UW 0x00311510 0x10b371c <rld_env+13953096> |
continuation=0x2227d0 <_ZN10IOWorkLoop22threadMainContinuationEv> |
activation thread pri state wait_queue wait_event |
0x0101aaf0 0x01095650 80 R |
stack_privilege=0x07950000 |
kernel_stack=0x07950000 |
stacktop=0x07953b90 |
0x07953b90 0xdf239e4 <com.apple.driver.AppleUSBProKeyboard + 0x19e4> |
0x07953be0 0xe546694 <com.apple.iokit.IOUSBFamily + 0x2694> |
0x07953c40 0xe5a84b4 <com.apple.driver.AppleUSBOHCI + 0x34b4> |
0x07953d00 0xe5a8640 <com.apple.driver.AppleUSBOHCI + 0x3640> |
0x07953d60 0xe5a93bc <com.apple.driver.AppleUSBOHCI + 0x43bc> |
0x07953df0 0x2239a8 <_ZN22IOInterruptEventSource12checkForWorkEv+18> |
0x07953e40 0x222864 <_ZN10IOWorkLoop10threadMainEv+104> |
0x07953e90 0x2227d0 <_ZN10IOWorkLoop22threadMainContinuationEv> |
stackbottom=0x07953e90 |
activation thread pri state wait_queue wait_event |
0x0101b530 0x0101c328 80 UW 0x00311500 0x10b605c <rld_env+13963656> |
continuation=0x2227d0 <_ZN10IOWorkLoop22threadMainContinuationEv> |
You can use showallstacks in debugging panics, hangs, and wedges. For instance, it might reveal a pair of threads that are deadlocked against each other or it might help to identify a thread that is not handling interrupts properly, thus causing a system hang.
Another common technique using the kernel debugging macros is to run the showallstacks macro and find the stack or stacks that are most of interest. Then run the switchtoact macro, giving it the address of a thread activation, to switch to the context of that thread and its stack. From there you can get a backtrace, inspect frames and variables, and so on. Listing 7-7 shows this technique.
Listing 7-7 Switching to thread activation and examining it
(gdb) switchtoact 0x00c29a48 |
(gdb) bt |
#0 0x00090448 in cswnovect () |
#1 0x0008f84c in switch_context (old=0xcca194, continuation=0, new=0xccab0c) at |
/SourceCache/xnu/xnu-327/osfmk/ppc/pcb.c:235 |
#2 0x000344b4 in thread_block_reason (continuation=0, reason=0) at |
/SourceCache/xnu/xnu-327/osfmk/kern/sched_prim.c:1629 |
#3 0x000334e0 in thread_sleep_fast_usimple_lock (event=0xeec500, lock=0x30a3ac, |
interruptible=213844) at /SourceCache/xnu/xnu-327/osfmk/kern/sched_prim.c:626 |
#4 0x00081ee0 in kmod_control (host_priv=0xeec500, id=4144, flavor=213844, |
data=0xc1202c, dataCount=0xc12048) at /SourceCache/xnu/xnu-327/osfmk/kern/kmod.c:602 |
#5 0x00045f1c in _Xkmod_control (InHeadP=0xc12010, OutHeadP=0xc12110) at |
mach/host_priv_server.c:958 |
#6 0x0002aa70 in ipc_kobject_server (request=0xc12000) at |
/SourceCache/xnu/xnu-327/osfmk/kern/ipc_kobject.c:309 |
#7 0x000253e4 in mach_msg_overwrite_trap (msg=0xf0080dd0, option=3, send_size=60, |
rcv_size=60, rcv_name=3843, timeout=12685100, notify=172953600, rcv_msg=0x0, |
scatter_list_size=0) at /SourceCache/xnu/xnu-327/osfmk/ipc/mach_msg.c:1601 |
#8 0x000257e0 in mach_msg_trap (msg=0xeec500, option=13410708, send_size=213844, |
rcv_size=4144, rcv_name=172953600, timeout=178377984, notify=256) at |
/SourceCache/xnu/xnu-327/osfmk/ipc/mach_msg.c:1853 |
#9 0x00092078 in .L_syscall_return () |
#10 0x10000000 in ?? () |
Cannot access memory at address 0xf0080d10 |
(gdb) f 4 |
#4 0x00081ee0 in kmod_control (host_priv=0xeec500, id=4144, flavor=213844, |
data=0xc1202c, dataCount=0xc12048) at /SourceCache/xnu/xnu-327/osfmk/kern/kmod.c:602 |
602 res = thread_sleep_simple_lock((event_t)&kmod_cmd_queue, |
(gdb) l |
597 simple_lock(&kmod_queue_lock); |
598 |
599 if (queue_empty(&kmod_cmd_queue)) { |
600 wait_result_t res; |
601 |
602 res = thread_sleep_simple_lock((event_t)&kmod_cmd_queue, |
603 &kmod_queue_lock, |
604 THREAD_ABORTSAFE); |
605 if (queue_empty(&kmod_cmd_queue)) { |
606 // we must have been interrupted! |
Remember that when use the switchtoact that you’ve actually changed the value of the stack pointer. You are in a different context than before. If you want to return to the former context, use the resetctx macro.
The showallkmods and showkmodaddr macros are also useful in driver debugging. The former macro lists all loaded kernel extensions in a format similar to the kextstat command-line utility (Listing 7-8 shows a few lines of output). If you give the showkmodaddr macro the address of an “anonymous” frame in a stack, and if the frame belongs to a driver (or other kernel extension), the macro prints information about the kernel extension.
Listing 7-8 Sample output from the showallkmods macro
(gdb) showallkmods |
kmod address size id refs version name |
0x0ebc39f4 0x0eb7d000 0x00048000 71 0 3.2 com.apple.filesystems.afpfs |
0x0ea09480 0x0ea03000 0x00007000 70 0 2.1 com.apple.nke.asp_atp |
0x0e9e0c60 0x0e9d9000 0x00008000 69 0 3.0 com.apple.nke.asp_tcp |
0x0e22b13c 0x0e226000 0x00006000 68 0 1.2 com.apple.nke.IPFirewall |
0x0e225600 0x0e220000 0x00006000 67 0 1.2 com.apple.nke.SharedIP |
0x0df5d868 0x0df37000 0x00028000 62 0 1.2 com.apple.ATIRage128 |
0x0de96454 0x0de79000 0x0001e000 55 3 1.3 com.apple.iokit.IOAudioFamily |
... |
If you hope to become proficient at I/O Kit driver debugging, you’ll have to become proficient in the use of gdb. There’s no getting around this requirement. But even if you are already familiar with gdb, you can always benefit from insights garnered by other driver writers from their experience.
If you don’t have symbols for a driver binary—and even if you do—you should try examining the computer instructions in memory to get a detailed view of what is going on in that binary. You use the gdb command x to examine memory in the current context; usually, x is followed by a slash (“/”) and one to three parameters, one of which is i. The examine-memory parameters are:
A repeat count
The display format: s (string), x (hexadecimal), or i (computer instruction)
The unit size: b (byte), h (halfword), w (word—four bytes), g (giant word—eight bytes)
For example, if you want to examine 10 instructions before and 10 instructions after the current context (as described in “Tips on Debugging Panics”), you could issue a command such as:
(gdb) x/20i $pc -40 |
This command says “show me 20 instructions, but starting 40 bytes” (4 bytes per instruction) “before the current address in the program counter” (the $pc variable). Of course, you could be less elaborate and give a simple command such as:
(gdb) x/10i 0x001c220c |
which shows you 10 computer instructions starting at a specified address. Listing 7-9 shows you a typical block of instructions.
Listing 7-9 Typical output of the gdb “examine memory” command
(gdb) x/20i $pc-40 |
0x8257c <kmod_control+124>: addi r3,r27,-19540 |
0x82580 <kmod_control+128>: bl 0x8d980 <get_cpu_data> |
0x82584 <kmod_control+132>: addi r0,r30,-19552 |
0x82588 <kmod_control+136>: lwz r31,-19552(r30) |
0x8258c <kmod_control+140>: cmpw r31,r0 |
0x82590 <kmod_control+144>: bne+ 0x825c0 <kmod_control+192> |
0x82594 <kmod_control+148>: mr r3,r31 |
0x82598 <kmod_control+152>: addi r4,r27,-19540 |
0x8259c <kmod_control+156>: li r5,2 |
0x825a0 <kmod_control+160>: bl 0x338a8 |
<thread_sleep_fast_usimple_lock> |
0x825a4 <kmod_control+164>: lwz r0,-19552(r30) |
0x825a8 <kmod_control+168>: cmpw r0,r31 |
0x825ac <kmod_control+172>: bne+ 0x825c0 <kmod_control+192> |
0x825b0 <kmod_control+176>: addi r3,r27,-19540 |
0x825b4 <kmod_control+180>: bl 0x8da00 <fast_usimple_lock+32> |
0x825b8 <kmod_control+184>: li r3,14 |
0x825bc <kmod_control+188>: b 0x82678 <kmod_control+376> |
0x825c0 <kmod_control+192>: lis r26,49 |
0x825c4 <kmod_control+196>: li r30,0 |
0x825c8 <kmod_control+200>: lwz r0,-19552(r26) |
Needless to say, you need to know some assembler in order to make sense of the output of the examine-memory command. You don’t need to be an expert in assembler, just knowledgeable enough to recognize patterns. For example, it would be beneficial to know how pointer indirection with an object looks in computer instructions. With an object, there are two indirections really, one to get the data (and that could be null) and one to an object’s virtual table (the first field inside the object). If that field doesn’t point to either your code or kernel code, then there’s something that might be causing a null-pointer exception. If your assembler knowledge is rusty or non-existent, you can examine the computer instructions for your driver’s code that you know to be sound. By knowing how “healthy” code looks in assembler, you’ll be better prepared to spot divergences from the pattern.
Using breakpoints to debug code inside the kernel can be a frustrating experience. Often kernel functions are called so frequently that, if you put a breakpoint on a function, it’s difficult to determine which particular case is the one with the problem. There are a few things you can do with breakpoints to ameliorate this.
Conditional breakpoints. A conditional breakpoint tells gdb to trigger a breakpoint only if a certain expression is true. The syntax is cond<breakpoint index> <expression>. An example is the following:
(gdb) cond 1 (num > 0) |
However, conditional breakpoints are very slow in two-machine debugging. Unless you’re expecting the breakpoint expression to be evaluated only a couple dozen times or so, they are probably too tedious to rely on.
Cooperative breakpoints. To speed things up you can use two breakpoints that cooperate with each other. One breakpoint is a conditional breakpoint set at the critical but frequently invoked function. The other breakpoint, which has a command list attached to it, is set at a point in the code which is only arrived at after a series of events has occurred. You initially disable the conditional breakpoint and the second breakpoint enables it at some point later where the context is more pertinent to the problem you’re investigating (and so you don’t mind the slowness of expression evaluation). The following series of gdb commands sets up both breakpoints:
(gdb) cond 1 (num > 0) |
(gdb) disable 1 |
(gdb) com 2 |
enable 1 |
continue |
end |
If this debugging is something you do frequently, you can put breakpoint-setup commands into a macro and put that macro in your .gdbinit file.
Breakpoints at dummy functions. You can use the previous two techniques on code that you do not own. However, if it’s your code that you’re debugging, the fastest way to trigger a breakpoint exactly when and where you want is to use dummy functions. Consider the follow stripped code snippet:
void dummy() { |
; |
} |
void myFunction() { |
// ... |
if (num > 0) |
dummy(); |
// ... |
} |
The expression in myFunction is the same exact expression you would have in a conditional breakpoint. Just set a breakpoint on the dummy function. When the breakpoint is triggered, get a backtrace, switch frames, and you’re in the desired context.
Single-stepping through source code does not necessarily take you from one line to the next. You can bounce around in the source code quite a bit because the compiler does various things with the symbols to optimize them.
There are two things you can do to get around this. If it’s your code you’re stepping through, you can turn off optimizations. Or you can single-step through the computer instructions in assembler because one line of source code typically generates several consecutive lines of assembler. So, if you find it hard to figure things out by single-stepping through source, try single-stepping through assembler.
To single-step in gdb, use the stepi command (si for short). You can get a better view of your progress if you also use the display command, as in this example:
(gdb) display/4i $pc |
This displays the program counter and the next three instructions as you step.
You might be familiar with kernel panics: those unexpected events that cripple a system, leaving it completely unresponsive. When a panic occurs on Mac OS X, the kernel prints information about the panic that you can analyze to find the cause of the panic. On pre-Jaguar systems, this information appears on the screen as a black and white text dump. Starting with the Jaguar release, a kernel panic causes the display of a message informing you that a problem occurred and requesting that you restart your computer. After rebooting, you can find the debug information on the panic in the file panic.log at /Library/Logs/.
If you’ve never seen it before, the information in panic.log might seem cryptic. Listing 7-10 shows a typical entry in the panic log.
Listing 7-10 Sample log entry for a kernel panic
Unresolved kernel trap(cpu 0): 0x300 - Data access DAR=0x00000058 PC=0x0b4255b4 |
Latest crash info for cpu 0: |
Exception state (sv=0x0AD86A00) |
PC=0x0B4255B4; MSR=0x00009030; DAR=0x00000058; DSISR=0x40000000; LR=0x0B4255A0; |
R1=0x04DE3B50; XCP=0x0000000C (0x300 - Data access) |
Backtrace: |
0x0B4255A0 0x000BA9F8 0x001D41F8 0x001D411C 0x001D6B90 0x0003ACCC |
0x0008EC84 0x0003D69C 0x0003D4FC 0x000276E0 0x0009108C 0xFFFFFFFF |
Kernel loadable modules in backtrace (with dependencies): |
com.acme.driver.MyDriver(1.6)@0xb409000 |
Proceeding back via exception chain: |
Exception state (sv=0x0AD86A00) |
previously dumped as "Latest" state. skipping... |
Exception state (sv=0x0B2BBA00) |
PC=0x90015BC8; MSR=0x0200F030; DAR=0x012DA94C; DSISR=0x40000000; LR=0x902498DC; |
R1=0xBFFFE140; XCP=0x00000030 (0xC00 - System call) |
Kernel version: |
Darwin Kernel Version 6.0: |
Wed May 1 01:04:14 PDT 2002; root:xnu/xnu-282.obj~4/RELEASE_PPC |
This block of information has several different parts, each with its own significance for debugging the problem.
The first line. The single most important bit of information about a panic is the first line, which briefly describes the nature of the panic. In this case, the panic has something to do with a data access exception. The registers that appear on the same line as the message are the ones with the most pertinent data; in this case they are the DAR (Data Access Register) and the PC (Program Counter) registers. The registers shown on the first line vary according to the type of kernel trap. The hexadecimal code before the description, which is defined in /xnu/osfmk/ppc_init.c, indicates the exception type. Table 7-3 describes the possible types of exceptions.
Trap Value | Type of Kernel Trap |
|---|---|
| System reset |
| Computer check |
| Data access |
| Instruction access |
| External interrupt |
| Alignment exception |
| Illegal instruction |
The registers. Under the first “Exception state” is a snapshot of the contents of the CPU registers when the panic occurred.
The backtrace. Each hexadecimal address in the backtrace indicates the state of program execution at a particular point leading up to the panic. Because each of these addresses is actually that of the function return pointer, you need to subtract four from this address to see the instruction that was executed.
The kernel extensions. Under “Kernel loadable modules in backtrace” are the bundle identifiers (CFBundleIdentifier property) of all kernel extensions referenced in the backtrace and all other kernel extensions on which these extensions have dependencies. These are the kernel extensions for which you’ll probably want to generate symbol files prior to debugging the panic.
The other exception states. Under “Proceeding back via exception chain:” are the previous exception states the kernel experienced, separated by snapshots of the contents of the CPU registers at the time the exceptions occurred. Most of the time, the first exception state (immediately following the first line of the panic log) gives you enough information to determine what caused the panic. Sometimes, however, the panic is the result of an earlier exception and you can examine the chain of exceptions for more information.
The kernel version. The version of the Darwin kernel and, more importantly, the build version of the xnu project (the core part of the Darwin kernel). If you are debugging with a symboled kernel (as is recommended), you need to get or build the symboled kernel from this version of xnu.
There are many possible ways to debug a kernel panic, but the following course of action has proven fruitful in practice.
Get as many binaries with debugging symbols as possible.
Make a note of all the kernel extensions listed under “Kernel loadable modules in backtrace”. If you don’t have debugging symbols for some of them, try to obtained a symboled version of them or get the source and build one with debugging symbols. This would include mach_kernel, the I/O Kit families, and other KEXTs that are part of the default install. You need to have the same version of the kernel and KEXT binaries that the panicked computer does, or the symbols won’t line up correctly.
Generate and add symbol files for each kernel extension in the backtrace.
Once you’ve got the kernel extension binaries with (or without) debugging symbols, generate relocated symbol files for each KEXT in the backtrace. Use kextload with the -s and -n options to do this; kextload prompts you for the load address of each kernel extension, which you can get from the backtrace. Alternatively, you can specify the -a option with -s when using kextload to specify KEXTs and their load addresses. Although you don’t need to relocate symbol files for all kernel extensions, you can only decode stack frames in the kernel or in KEXTs that you have done this for. After you run gdb on mach_kernel (preferably symboled), use gdb’s add-symbol-file command for each relocatable symbol files you’ve generated; see “Setting Up for Two-Machine Debugging” for details.
Decode the addresses in the panic log.
Start with the PC register and possibly the LR (Link Register). (The contents of the LR should look like a valid text address, usually a little smaller than the PC-register address.) Then process each address in the backtrace, remembering to subtract four from each of the stack addresses to get the last instruction executed in that frame. One possible way to go about it is to use a pair of gdb commands for each address:
(gdb) x/i <address>-4 |
... |
(gdb) info line *<address>-4 |
You need the asterisk in front of the address in the info command because you are passing a raw address rather than the symbol gdb expects. The x command, on the other hand, expects a raw address so no asterisk is necessary.
Listing 7-11 gives an example of a symbolic backtrace generated from x/i <address>-4. You’ll know you’ve succeeded when all the stack frames decode to some sort of branch instruction in assembler.
Interpret the results.
Interpreting the results of the previous step is the hardest phase of debugging panics because it isn’t mechanical in nature. See the following section, “Tips on Debugging Panics,” for some suggestions.
Listing 7-11 Example of symbolic backtrace
(gdb) x/i 0x001c2200-4 |
0x1c21fc <IOService::PMstop(void)+320>: bctrl |
0xa538260 <IODisplay::stop(IOService *)+36>: bctrl |
0x1bbc34 <IOService::actionStop(IOService *, IOService *)+160>: bctrl |
0x1ccda4 <runAction__10IOWorkLoopPFP8OSObjectPvn3_iPB2Pvn3+92>: bctrl |
0x1bc434 <IOService::terminateWorker(unsigned long)+1824>: bctrl |
0x1bb1f0 <IOService::terminatePhase1(unsigned long)+928>: bl |
0x1bb20c <IOService::scheduleTerminatePhase2(unsigned long)> |
0x1edfcc <IOADBController::powerStateWillChangeTo(unsigned long, unsigned long, IOService *)+88>: bctrl |
0x1c54e8 <IOService::inform(IOPMinformee *, bool)+204>: bctrl |
0x1c5118 <IOService::notifyAll(bool)+84>: bl |
0x1c541c <IOService::inform(IOPMinformee *, bool)> |
0x1c58b8 <IOService::parent_down_05(void)+36>: bl |
0x1c50c4 <IOService::notifyAll(bool)> |
0x1c8364 <IOService::allowCancelCommon(void)+356>: bl |
0x1c5894 <IOService::parent_down_05(void)> |
0x1c80a0 <IOService::serializedAllowPowerChange2(unsigned long)+84>: bl |
0x1c8200 <IOService::allowCancelCommon(void)> |
0x1ce198 <IOCommandGate::runAction(int (*)(OSObject *, void *, void *, void *, void *), void *, void *, void *, void *)+184>: bctrl |
0x1c802c <IOService::allowPowerChange(unsigned long)+72>: bctrl |
0xa52e6c ???? |
0x3dfe0 <_call_thread_continue+440>: bctrl |
0x333fc <thread_continue+144>: bctrl |
The following tips might make debugging a panic easier for you.
As noted previously, always pay attention to the first line of the panic message and the list of kernel extensions involved in the panic. The panic message provides the major clue to the problem. With the kernel extensions, generate relocated symbol files for the debugging session.
Don’t assume that the panic is not your driver’s fault just because it doesn’t show up in the backtrace. Passing a null pointer to an I/O Kit family or any other body of kernel code will cause a panic in that code. Because the kernel doesn’t have the resources to protect itself from null pointers, drivers must be extremely vigilant against passing them in.
The showallstacks kernel debugging macro is very useful for debugging panics. See “Using the Kernel Debugging Macros” for more information.
Use gdb’s $pc variable when debugging panics with gdb. The $pc variable holds the value of the program counter, which identifies the place where the panic exception was taken. If you want to examine the context of the panic, you could issue a command such as:
(gdb) x/20i $pc -40 |
This displays 10 instructions in assembler before and after the point where the panic occurred. If you have the appropriate source code and symbols, you can enter:
(gdb) l *$pc |
This shows the particular line of code that took the panic.
If you have a panic caused by an Instruction Access Exception (0x400) and the PC register is zero, it means that something in the kernel branched to zero. The top frame in the stack is usually a jump through some function pointer that wasn’t initialized (or that somehow got “stomped”).
Panics caused by a Data Access Exception 0x300 are quite common. These types of panics typically involve indirection through zero (in other words, a dereferenced null pointer). Any time a null pointer is dereferenced, a panic results. When you get a Data Access Exception, first check the DAR register; if the value is less than about 1000 the panic is probably the result of indirection through a null pointer. This is because most classes are no larger than about 1000 bytes and when an offset is added to a null pointer, the result is about 1000 or less. If the result is much larger, it’s probable that the location of a pointer has been trashed (as opposed to its contents) and the contents of an unknown location is being used as a pointer in your code.
A null pointer implies the possibility of a race condition with a shutdown or completion value. When called, a completion routine starts to free resources and if your driver is referencing these resources after the routine is called, it will probably get null pointers back.
If you get a panic in kalloc, kmem, or IOMalloc, it suggests that you’re using a freed pointer. Even if your driver code doesn’t show up in the backtrace, your driver could be the culprit. Using a freed pointer is likely to break the kernel’s internal data structures for allocation.
Panics can also be caused by accidentally scribbling on someone else’s memory. These “land mines” can be notoriously hard to debug; a backtrace doesn’t show you much, except that something bad has happened. If the panic is reproducible, however, you have a chance to track down the offending code by using a “probe” macro. A probe macro helps to bracket exactly where in the code the scribbling happened. By definition, a scribble is a byte that does not have the expected contents (because they were altered by some other code). By knowing where the panic occurred and where the byte last held its expected value, you know where to look for the scribbling code.
Often it’s the case that your driver is the offending scribbler. To find your offending code (if any), define a probe macro that tests whether a memory address (A in the example below) has the expected value (N). (Note that A is the address in the DAR register of the original scribble panic.) If A doesn’t have the expected value, then cause a panic right then and there:
Uint32 N = 123; |
Uint32 *A; |
A = &N; |
// ... |
#define PROBE() do { |
if (A && *A != N) |
*(Uint32)0 = 0; |
} while (0) |
By putting the probe macro in every function where A and N appear, you can narrow down the location of the scribble. If your driver is not the one doing the scribbling, it still might be indirectly responsible because it could be causing other code to scribble. For example, your driver might be asking some other code to write in your address space; if it’s passed them the wrong address, it might result in data being scribbled on inside it.
System hangs are, after kernel panics, the most serious condition caused by badly behaved kernel code. A hung system may not be completely unresponsive, but it is unusable because you cannot effectively click the mouse button or type a key. You can categorize system hangs, and their probable cause, by the behavior of the mouse cursor.
Cursor doesn’t spin and won’t move. This symptom indicates that a primary interrupt is not being delivered. The mouse doesn’t even spin because that behavior is on a primary interrupt; its not spinning indicates that the system is in a very tight loop. What has probably happened is that a driver object has disabled an interrupt, causing code somewhere in the driver stack to go into an infinite loop. In other words, a piece of hardware has raised an interrupt but the driver that should handle it is not handling it and so the hardware keeps raising it. The driver has probably caused this “ghost” interrupt, but is unaware it has and so is not clearing it.
Cursor spins but won’t move. This symptom indicates that a high-priority thread such as a timer is spinning.
Cursor spins and moves, but nothing else. This symptom suggests that the USB thread is still scheduled, thus indicating a deadlock in some driver object that is not related to USB or HI.
For system hangs with the first symptom—the cursor doesn’t spin and won’t move—your first aim should be to find out what caused the interrupt. Why is the hardware controlled by your driver raising the interrupt? If your driver is using a filter interrupt event source (IOFilterInterruptEventSource), you might want to investigate that, too. With a filter event source a driver can ignore interrupts that it thinks aren’t its responsibility.
With any system hang, you should launch gdb on the kernel, attach to the hung system and run the showallstacks macro. Scan the output for threads that are deadlocked against each other. Or, if it’s an unhandled primary interrupt that you suspect, find the running thread; if it is the one that took the interrupt, it is probably the thread that’s gone into an infinite loop. If the driver is in an infinite loop, you can set a breakpoint in a frame of the thread’s stack that is the possible culprit; when you continue and hit the breakpoint almost immediately, you know you’re in an infinite loop. You can single-step from there to find the problem.
The Mac OS X BootX booter copies drivers for hardware required in the boot process into memory for the kernel’s boot-time loading code to load. Because boot drivers are already loaded by the time the system comes up, you do not have as much control over them as you do over non-boot drivers. In addition, a badly behaving boot driver can cause your system to become unusable until you are able to unload it. For these reasons, debugging techniques for boot drivers vary somewhat from those for other drivers.
The most important step you can take is to treat your boot driver as a non-boot driver while you are in the development phase. Remove the OSBundleRequired property from your driver’s Info.plist file and use the techniques described in this chapter to make sure the driver is performing all its functions correctly before you declare it to be a boot driver.
After you’ve thoroughly tested your driver, add the OSBundleRequired property to its Info.plist (see the document Loading Kernel Extensions at Boot Time to determine which value your driver should declare). This will cause the BootX booter to load your driver into memory during the boot process.
If your boot driver does have bugs you were unable to find before, you cannot use gdb to debug it because it is not possible to attach to a computer while it is booting. Instead, you must rely on IOLog output to find out what is happening. IOLog is synchronous when you perform a verbose boot so you can use IOLog statements throughout your boot driver’s code to track down the bugs. See “Using IOLog” for more information on this function.
To perform a verbose boot, reboot holding down both the Command and V keys. To get even more detail from the I/O Kit, you can set a boot-args flag before rebooting. Assuming root privileges with the sudo command, type the following on the command line
%sudo nvram boot-args="io=0xffff" |
Password: |
%shutdown -r now |
Although this technique produces voluminous output, it can be difficult to examine because it scrolls off the screen during the boot process. If your boot driver does not prevent the system from completing the boot process, you can view the information in its entirety in the system log at /var/log/system.log.
Last updated: 2007-03-06