Debugging Drivers

The hardest part of debugging code—especially kernel code—is not learning how to use the tools but how to analyze the data generated by the tools. The success you have with your analysis depends to a large degree on the skills and experience you bring to it, and these can always be enhanced by learning from others who have “earned their stripes” debugging kernel code.

This chapter offers a collection of driver-debugging tips and techniques from Apple engineers. It starts by listing a few general suggestions for driver debugging and itemizing the tools at your disposal. It then discusses these general areas of driver debugging:

Some Debugging Basics

Before delving into the nitty-gritty of driver debugging, it might be worth your while to consider a few general debugging tips and to survey the debugging tools available to you.

General Tips

To help you get started, here are a few debugging tips based on the experiences of Apple kernel engineers.

  • Develop in incremental steps. In other words, don’t write a big chunk of driver code before debugging it. And at each stage verify that the driver can unload as well as load; a driver that can’t unload suggests a memory leak (most likely caused by a missing release somewhere).

    As a suggestion, get driver matching and the basic driver life-cycle methods working first (especially start). Then begin talking with your hardware. Once you’re sure that’s working properly, then write the code that talks with the rest of the system.

  • Keep earlier versions of your code. If bugs start to appear in your code, you can “diff” the current version with the prior version to locate exactly which code is new (and thus the probable source of the problem).

  • Obtain the source code for all other parts of the kernel that your driver interacts with or depends on (directly or indirectly). Things go much easier when you have full source-level debugging.

  • Obtain symbols for each kernel extension that might affect, or be affected by, your driver. This would include your driver’s provider and its clients. For the full gamut of symbols, obtain a symboled kernel. In particular, a symboled kernel is required to use the kernel debugging macros (described in “Using the Kernel Debugging Macros”).

    See “Setting Up for Two-Machine Debugging” for instructions on generating symbol files for kernel extensions. To obtain a symboled kernel, you might have to build the Darwin kernel from the open source. See Building and Debugging Kernels in Kernel Programming Guide for details.

  • Learn gdb thoroughly. Start by reading Debugging with GDB, provided as part of Tools Documentation.

  • Become familiar with how computer instructions look in assembler (see “Examining Computer Instructions”).

  • Every now and then, single-step through every line of code to make sure your driver’s doing exactly what you want it to.

  • Don’t overly rely on calls to IOLog to provide you with debugging information (see “Using IOLog”).

  • Make sure your KEXT includes only header files from Kernel.framework, in addition to the header files you define. If you include other headers, although their definitions may be in scope at compile time, the functions and services they define will not be available in the kernel environment.

Issues With 64-Bit Architectures

In OS X version 10.3, Apple introduced some changes to support the new 64-bit architectures. Because the changes are implemented in the operating system itself, some drivers may be affected even if they aren’t running on a 64-bit computer. To better explain these changes, this section first provides a brief description of PCI address translation.

Hardware devices on the PCI bus can handle 32-bit addresses, which means that a PCI device has a 4-gigabyte window into main memory. When a PCI device performs a data transaction to or from main memory, the device driver prepares this memory for I/O. On systems where both the memory subsystem and the PCI device drivers use 32-bit addressing, there are no difficulties. On systems where the memory subsystem uses 64-bit addressing, however, a PCI device can see only four gigabytes of main memory at a time. To address this issue, Apple chose to implement address translation. In this scheme, blocks of memory are mapped into the 32-bit address space of PCI devices. The PCI device still has a 4-gigabyte window into main memory, but that window may contain noncontiguous blocks of main memory. The address translation is performed by a part of the memory controller called the DART (device address resolution table). The DART keeps a table of translations to use when mapping between the physical addresses the processor sees and the addresses the PCI device sees (called I/O addresses).

If your driver adheres to documented, Apple-provided APIs, this address-translation process is transparent. For example, when your driver calls IOMemoryDescriptor’s prepare method, a mapping is automatically placed in the DART. Conversely, when your driver calls IOMemoryDescriptor’s release method, the mapping is removed. Although this has always been the recommended procedure, failure to do so may not have resulted in undesirable behavior in previous versions of OS X. Beginning in OS X version 10.3, however, failure to follow this procedure may result in random data corruption or panics.

If your driver experiences difficulty on an OS X version 10.3 or later system, you should check that you are following these guidelines:

  • Always call IOMemoryDescriptor::prepare to prepare the physical memory for the I/O transfer (in OS X version 10.3 and later, this also places a mapping into the DART).

  • Balance each call to prepare with a call to complete to unwire the memory.

  • Always call IOMemoryDescriptor::release to remove the mapping from the DART.

  • On hardware that includes a DART, pay attention to the DMA direction for reads and writes. On a 64-bit system, a driver that attempts to write to a memory region whose DMA direction is set up for reading will cause a kernel panic.

A side effect of the changes in the memory subsystem is that OS X is much more likely to return physically contiguous page ranges in memory regions. In earlier versions of OS X, the system returned multi-page memory regions in reverse order, beginning with the last page and going on towards the first page. Because of this, a multi-page memory region seldom contained a physically contiguous range of pages.

The greatly increased likelihood of seeing physically contiguous page ranges in memory regions might expose latent bugs in drivers that previously did not have to handle physically contiguous pages. If your driver is behaving incorrectly or panicking, be sure to investigate this possibility.

Another result of the memory-subsystem changes concerns the physical addresses some drivers obtain directly from the pmap layer. Because there is not a one-to-one correspondence between physical addresses and I/O addresses, physical addresses obtained from the pmap layer have no purpose outside the virtual memory system itself. Drivers that use pmap calls to get such addresses (such as pmap_extract) will fail to function on systems with a DART. To prevent the use of these calls, OS X version 10.3 will refuse to load a kernel extension that uses them, even on systems without a DART.

Driver-Debugging Tools

Apple provides a number of tools that you can use to debug I/O Kit device drivers. Table 7-1 presents some of the more important tools.

Table 7-1  Debugging aids for kernel extensions

Tool or API

Description

gdb

The GNU debugger, used in two-machine kernel debugging. See “Tips on Using gdb” for some useful information on gdb.

Kernel debugging macros

Powerful macros designed for use in gdb while debugging the Darwin kernel. The macros are contained in the file .gdbinit in the Open Source directory location /xnu/osfmk/. See “Using the Kernel Debugging Macros” for more information.

kextload

A utility that does a variety of things with kernel extensions. It loads kernel extensions and, prior to loading, validates the extensions, providing diagnostic information related to file-permission and dependency errors. It also enables the debugging of kernel extensions (using gdb) during the matching and loading phases. See “Loading Your Driver” for more on kextload options.

kextstat

A utility that prints columns of information about the kernel extensions that are currently loaded in a system. The information most useful to driver debugging is the driver’s load address, references held by the driver to other kernel extensions, the number of references other extensions have to it, and the version of a kernel extension. See “Unloading Your Driver” for more information.

ioreg

Provides a snapshot of the I/O Registry, showing the hierarchical structure of client–provider relationships among current drivers and nubs in the system. With the necessary options, it shows the properties associated with each node in the registry. The I/O Registry Explorer shows the same information in a GUI interface. Both tool and application are particularly useful for debugging problems with matching. See “Debugging Matching Problems.”

ioalloccount ioclasscount

The former tool displays a summary of memory allocation by allocator type (instance, container, and IOMalloc). The latter tool shows the number of instances allocated for each specified class. Both tools are useful for tracking down memory leaks. See “Unloading Your Driver” for more information.

IOKitDebug.h

Defines a set of values for the IOKitDebug property. When this property is defined, the system writes the status of various driver events, such as attaching, matching, and probing, to the system log. See “Debugging Matching Problems” for further information.

Debugging Matching and Loading Problems

Before you begin debugging your driver’s hardware-specific functionality, you should first ensure that your KEXT is valid and that it loads, unloads, and matches properly. This section describes how to use OS X tools, such as kextload and kextstat, to authenticate and validate your KEXT, resolve its dependencies, and test its matching.

Driver Dependencies

Every driver declares its dependencies on other loadable kernel extensions (such as an I/O Kit family), kernel subcomponents (such as com.apple.kernel.iokit), or KPIs (such as com.apple.kpi.libkern) in its Info.plist file. At the top level of the Info.plist file, a driver lists each dependency in the form of a key-value pair in the OSBundleLibraries dictionary; the key is the KEXT or kernel subcomponent name and the value is the KEXT’s or subcomponent’s version number.

In OS X v10.2, declaring a dependency on the I/O Kit kernel subcomponent com.apple.kernel.iokit (version 6.x) implicitly allows your driver to access symbols in two other kernel subcomponents: com.apple.kernel.mach and com.apple.kernel.bsd.

In OS X v10.3, a driver that declares a dependency on com.apple.kernel.iokit version 6.x (the 10.2 version) will still enjoy automatic access to Mach and BSD symbols.

In OS X v10.4, Apple introduced sustainable kernel programming interfaces, or KPIs. In particular, the KPIs support the development of NKEs (network kernel extensions), file-system KEXTs, and other non-I/O Kit KEXTs. KPIs are also recommended for pure I/O Kit KEXTs (KEXTs that use only I/O Kit-provided APIs) that target OS X v10.4 and later. Be aware, however, that no KEXT can declare dependencies on a combination of both kernel subcomponents and KPIs. To learn more about KPIs, see Kernel Framework Reference; to find out how to declare dependencies on them, see Kernel Extension Dependencies.

Using kextload, kextunload, and kextstat

During the development process, it’s a good idea to load and unload your driver regularly to catch problems in these processes before they become very difficult to find. The kextload tool allows you to load your driver at the command line and performs a wide range of authentication and validation tests. kextload also allows you to debug your driver’s start and probe routines before matching begins. The kextunload tool terminates and unregisters the I/O Kit objects associated with your KEXT and unloads the code and personalities for that KEXT. For complete information on kextload, kextunload, and kextstat, view the man pages for these tools in a Terminal window.

You can use kextload to load and start your KEXT with or without starting I/O Kit matching, but kextload provides no information about the matching process itself. If your driver isn’t matching properly, see “Debugging Matching Problems” for more information.

Loading Your Driver

The kextload tool checks your driver’s Info.plist file thoroughly before it attempts to load your driver into the kernel. If required properties are missing or incorrectly identified, if your declared dependencies are missing or incompatible, or if your driver has the wrong permissions or ownership, kextload refuses to load your KEXT. To find out why your KEXT doesn’t load, run kextload with the -t or the -tn option. This tells kextload to perform all its tests on your KEXT and provide you with a dictionary of the errors it finds (the addition of the -n option prevents kextload from loading the KEXT even if it finds no errors). For example, Listing 7-1 shows the Info.plist of a KEXT named BadKEXT, that contains three errors.

Listing 7-1  Example Info.plist containing errors

<key>IOKitPersonalities</key>
<dict>
    <key>BadKEXT</key>
    <dict>
        <key>CFBundleIdentifier</key>
        <string>com.MySoftwareCompany.driver.BadKEXT</string>
 
        <key>IOClass></key>
        <string>com_MySoftwareCompany_driver_BadKEXT</string>
 
        <!-- No floating point numbers are allowed in the kernel. -->
        <key>IOKitDebug</key>
        <number>0.0</number>
 
        <key>IOMatchCategory</key>
        <string>com_MySoftwareCompany_driver_BadKEXT</string>
 
        <key>IOResourceMatch</key>
        <string>IOKit</string>
 
        <!-- The personality is missing its IOProviderClass key-value pair. -->
    </dict>
</dict>
<key>Libraries</key>
<dict>
    <key>com.apple.kernel.iokit</key>
    <string>1.1</string>
 
    <!-- com.apple.kernel.libkern is misspelled. -->
    <key>com.apple.kernel.libker</key>
    <string>1.1</string>
 
    <key>com.apple.kernel.mach</key>
    <string>1.1</string>
</dict>

In addition, BadKEXT has ownership and permissions problems: It is owned by the user user_name and the group staff and its folders have permissions rwxrwxr-x (775 octal). For reasons of security, KEXTs must be owned by the root user and the wheel group. Further, no component of a KEXT may be writable by any user other than root. (For more information on the correct permissions and ownership of KEXTs, see “Testing and Deploying Drivers”). Assuming root privileges by using the sudo command and running kextload -t on BadKEXT provides the following information, shown in Listing 7-2.

Listing 7-2  kextload -t outputs errors found in KEXT

[computer_name:] user_name% sudo kextload -t BadKEXT.kext
Password:
can't add kernel extension BadKEXT.kext (validation error) (run kextload
                            on this kext with -t for diagnostic output)
kernel extension BadKEXT.kext has problems:
Validation failures
{
    "Info dictionary missing required property/value" = {
        "IOKitPersonalities:BadKEXT:IOProviderClass"
    }
    "Info dictionary property value is of illegal type" = {
        "IOKitPersonalities:BadKEXT:IOKitDebug"
    }
}
Authentication failures
{
    "File owner/permissions are incorrect" = {
    "/Users/user_name/Projects/BadKEXT/build/BadKEXT.kext"
    "/Users/user_name/Projects/BadKext/build/BadKEXT.kext/Contents/Info.plist"
    "/Users/user_name/Projects/BadKext/build/BadKEXT.kext/Contents"
    "/Users/user_name/Projects/BadKext/build/BadKEXT.kext/Contents/MacOS/BadKEXT"
        "/Users/user_name/Projects/BadKext/build/BadKEXT.kext/Contents/MacOS"
    }
}
Missing dependencies
{
    "com.apple.kernel.libker" =
        "No valid version of this dependency can be found"
}

The checks kextload performs are listed below.

  • Info.plist dictionary must be present and must be of type dictionary.

  • CFBundleIdentifier property must be of type string and no longer than 63 characters.

  • CFBundleVersion property must be a proper “vers” style string.

  • If the optional OSBundleCompatibleVersion property is present, it must be a proper “vers” style string with a value less than or equal to the CFBundleVersion value.

  • CFBundleExecutable must be of type string.

  • If the IOKitPersonalities property is present (the IOKitPersonalities property is required for driver matching but isn’t required for the KEXT to be valid), all its values must be of type dictionary. kextload performs the following checks on the personalities.

    • IOClass property must be of type string.

    • IOProviderClass property must be of type string.

    • CFBundleIdentifier must be of type string.

    • IOKitDebug property, if present, must be of type integer (and not a floating-point number).

    • All property types must be valid for in-kernel use; for example, the date type is not allowed in the kernel.

  • OSBundleLibraries must be present and be of type dict. The OSBundleLibraries values must exist and have valid versions.

  • If all personalities have the IOKitDebug property set to a nonzero value, kextload marks the KEXT an ineligible for safe boot, even if the OSBundleRequired property is present and valid.

If the -t option doesn’t provide you with enough information, you can tell kextload to give you a step-by-step account of its actions by using the -v option followed by a number from 1 to 6 indicating how much information you want. In this way, you can see at what point kextload fails.

Unloading Your Driver

The kextunload tool simply tries to unload your driver’s code and personalities, terminating and unregistering all I/O Kit objects associated with your driver. Unfortunately, kextunload provides very little information about why your driver might fail to unload. It’s possible that the system log (which you can view at /var/log/system.log) will display some information about a failure to unload, but you’ll get the most useful information using the other methods and tools described in this section.

A driver that has a mismatched retain or release, one that is depended on by other kernel modules, or one that has a failing stop method will fail to unload. Whatever the cause, if your driver fails to unload, the debugging process becomes arduous because you are forced to restart the computer each time you need to load and unload your KEXT.

In general, you should make only incremental changes in your driver’s code, checking its ability to load and unload after every change, to narrow down the possible culprits if your driver suddenly refuses to unload. In this way, you can revert to a previous, unloadable version of your driver and examine the changes you’ve made to it.

You can determine if your driver cannot unload because another module is depending on it by using the kextstat tool to display information about all currently loaded KEXTs. Listing 7-3 shows a few lines of the output from kextstat.

Listing 7-3  Partial listing of kextstat output

Index Refs Address  Size    Wired Name          (Version) <Linked Against>
 1   1   0x0        0x0     0x0  com.apple.kernel (6.0)
 2   8   0x0        0x0     0x0  com.apple.kernel.bsd (6.0)
 3  33   0x0        0x0     0x0  com.apple.kernel.iokit (6.0)
 4  32   0x0        0x0     0x0  com.apple.kernel.libkern (6.0)
// Some lines not shown...
11   7   0xaadd000  0x9000  0x8000 com.apple.iokit.IOPCIFamily (1.2) <4 3>
// Some lines not shown...
26   6   0xb14c000  0x1a000 0x19000 com.apple.iokit.IOUSBFamily (1.2) <4 3 2>
// Some lines not shown...

The first column of the kextstat output is the load index of the KEXT. kextstat uses this index in the last column to show which KEXTs a particular KEXT is linked against. For example, the IOPCIFamily KEXT, index 11 in Listing 7-3, depends on both com.apple.kernel.iokit (index 3) and com.apple.kernel.libkern (index 4).

The second column is the sum of all the references to the selected KEXT by other currently loaded KEXTs. Check this column if your driver fails to unload: If its number of references is nonzero, kextunload cannot unload it.

If there are no other KEXTs depending on your driver yet it fails to unload, the problem is often a mismatched retain or release on an object instantiated from one of your classes. If you have multiple classes defined in your KEXT, you can use ioclasscount to find out which class has outstanding instances. In a Terminal window, type

ioclasscount MyClassName

The ioclasscount tool outputs the instance count of the specified object, offset by the number of direct subclasses that have at least one instance allocated. With this information, you then go through your driver’s code looking for a retain on your object that’s not balanced with a release. A mismatched retain can be difficult to find because often it is not explicitly in your code, but in a container class you’ve put your object into, in a method you’ve called, or in a still-running user application that holds an io_object_t reference on the object. For example, calling

addEventSource ( My_Event_Source_Object )

implicitly places a retain on My_Event_Source_Object that remains until you call

removeEventSource ( My_Event_Source_Object )

Debugging Your Driver’s start and probe Methods

When you’ve determined that your KEXT can load and unload successfully, you may want to debug your driver’s start and probe methods. To do this, you use kextload. This is because installing your KEXT in /System/Library/Extensions and allowing kextd to load it when needed does not present you with the opportunity to attach to the computer in a gdb session before your driver has already matched and started.

If you are developing an I/O Kit driver, you can use kextload with the -l option to load your KEXT binary on the target computer but to refrain from sending any personalities to the kernel. By separating loading from matching, kextload -l allows you to defer matching (and the calling of your driver’s start and probe methods) until after you set up a two-machine debugging session (for more information on how to set up such a session, see “Setting Up for Two-Machine Debugging”).

To avoid matching before you’re ready, you can invoke kextunload on your driver before using kextload -l to make sure none of your driver’s personalities remain in the kernel from an earlier debugging session. Be aware, however, that plugging in a new device after you’ve used kextload -l to load your KEXT binary may cause some other driver to match on the device. If this is not desirable, take care not to change your system’s configuration until you are ready.

After you’ve attached to the target computer and set breakpoints on the start and probe methods, you use kextload with the -m option to send the personalities to the kernel, which triggers matching. If your KEXT’s start and probe methods do not execute, a successful match did not occur and you should examine the driver personalities to find out why. See “Debugging Matching Problems” for advice on how to debug matching problems.

If you are developing a non–I/O Kit KEXT, such as an NKE (network kernel extension) or a file system KEXT, you implement the MODULE_START method instead of the start method an I/O Kit driver implements. To debug the MODULE_START method, you use kextload with the -i or -I options to interactively load the KEXT and its dependencies, pausing after each step for your permission to continue.

Using the -i option tells kextload to automatically load your KEXT’s dependencies but to ask for permission to continue at each step in the process of loading the KEXT itself. The -I option causes kextload to pause for input at each step of every stage of loading, including the loading of dependencies. If you want to debug your KEXT’s MODULE_START method, you use kextload -i or -I to load your KEXT binary and then set up a two-machine debugging session before you give permission to invoke the MODULE_START method. Listing 7-4 shows the loading of an example KEXT called MyKEXT using kextload -i.

Listing 7-4  Using kextload -i for interactive KEXT loading

[computer_name:/tmp] user_name% sudo kextload -i MyKEXT.kext
Password:
Load extension MyKEXT.kext and its dependencies [Y/n]? y
 
Load module /tmp/MyKEXT.kext/Contents/MacOS/MyKEXT [Y/n]? y
kextload: module com.MySoftwareCompany.driver.MyKEXT created as #70 at
                                        address 0xaf5c000, size 8192
kextload: You can now break to the debugger and set breakpoints for this
                                                            extension.
 
Start module /tmp/MyKEXT.kext/Contents/MacOS/MyKEXT (answering no will abort
                                                    the load) [Y/n]? y
kextload: started module /tmp/MyKEXT.kext/Contents/MacOS/MyKEXT
kextload: MyKEXT.kext loaded successfully
 
Send personality "MyKEXT" to the kernel [Y/n]? y
kextload: matching started for MyKEXT.kext
[computer_name:/tmp] user_name%

Debugging Matching Problems

Getting your driver to match correctly can be one of the more challenging aspects of I/O Kit driver development. If your driver loads and unloads properly but does not match, the most important thing you can do is become thoroughly familiar with your family’s matching language. Although the I/O Kit implements driver matching in a three-phase, subtractive process (described in detail inI/O Kit Fundamentals), each I/O Kit family is free to define an arbitrarily complex matching language of its own.

After the I/O Kit discards drivers of the wrong class in the class-matching phase, the I/O Kit examines the remaining drivers for family-specific matching properties in the passive-matching phase. If the family implements family-specific matching, it considers the candidate driver’s personality and raises or lowers the driver’s probe score based on the matches it finds.

Although some families define no family-specific matching, instead preferring drivers to probe the device, many others exercise some control over the passive-matching phase. The USB family, for example, requires its drivers to use specific combinations of properties defined by the USB Common Class Specification. In order for your USB driver to match, you must include all the elements of exactly one property combination. Another example is the PCI family which defines a number of match keys whose values are the contents of various PCI configuration-space registers. A candidate PCI driver can then match on a single register value, a list of values, or even a partial register value indicated by a supplied bit mask.

If your driver does not match, first make sure the property values in your driver’s personality match the corresponding properties your device publishes in the I/O Registry. You can use the I/O Registry Explorer application (available at /Developer/Applications) or the command-line tool ioreg to view the properties your device publishes. If you use I/O Registry Explorer, select Find to search for your device name, follow the path to the object representing your device, and view its properties in the lower window. Alternately, to view the I/O Registry in a Terminal window, type

ioreg -bl

The -b option displays the object names in bold and the -l option displays the object properties.

If there are no discrepancies between your device’s published properties and your driver’s matching properties, next make sure you know how your driver’s family implements matching. Families that define their own matching language do so by implementing the matchPropertyTable method. By viewing the code for this method, you can determine how your driver’s family uses the properties it finds in a driver’s personality dictionary to adjust the probe score.

In very rare cases, a driver might declare IOResources as the value of its IOProviderClass key. IOResouces is a special nub attached to the root of the I/O Registry that makes resources, such as the BSD kernel, available throughout the system. Traditionally, drivers of virtual devices match on IOResources because virtual devices do not publish nubs of their own. Another example of such a driver is the HelloIOKit KEXT (described in Hello I/O Kit: Creating a Device Driver With Xcode) which matches on IOResources because it does not control any hardware.

Finally, you can place the IOKitDebug property in your driver’s personality dictionary. When you give this property a non-zero value, it places the status of various events, such as attaching, matching, and probing, into the system log (available at /var/log/system.log). The value of the IOKitDebug property defines which events are logged; see IOKitDebug.h (available in /System/Library/Frameworks/Kernel.framework/Headers/IOKit) for the I/O Kit–defined values. Unless you are interested in only a single event or a particular set of events, set the value to 65535 to get information about all the events IOKitDebug covers. With this information, you can see which events occurred and whether they succeeded or failed, but not why they failed.

Although the IOKitDebug property can be useful during development, be sure to set its value to 0 (or remove the property altogether) before shipping your driver because a non-zero value will prevent your driver from loading during a safe-boot. For more information on safe booting, see the Loading Kernel Extensions at Boot Time section of Kernel Extension Programming Topics.

Two-Machine Debugging

For debugging a device driver or indeed any code that resides in the kernel, two computers are a necessity. Buggy kernel code has a nasty tendency to crash or hang a system, and so directly debugging that system is often impossible.

Two-machine debugging with gdb is the main pathway to finding bugs in driver code. This section takes you through the procedure for setting up two computers for debugging, offers a few tips on using gdb in kernel code, introduces you to the kernel debugging macros, and discusses techniques for finding bugs causing kernel panics and hangs.

Setting Up for Two-Machine Debugging

This section summarizes the steps required to set up two computers for kernel debugging. It draws heavily on the following documents, which you should refer to for more detailed information:

In two-machine debugging, one system is called the target computer and the other the host (or development) computer. The host computer is the computer that actually runs gdb. It is typically the computer on which your driver is developed—hence, it’s also referred to as the development computer. The target computer is the system on which the driver to be debugged is run. Your host and target computers should be running the same version of the Darwin kernel, or as close as possible to same version. (Of course, if you’re debugging a panic-prone version of the kernel, you’ll want the host computer to run the most recent stable version of Darwin.) For optimal source-level debugging, the host computer should have the source code of the driver, any kernel extensions related to your driver (such as its client or provider), and perhaps even the kernel itself (/xnu).

In order for two-machine debugging to be feasible, the following must be true:

  • For versions of OS X before OS X version 10.2, both computers must be on the same subnet.

  • You must have login access to both computers as an administrator (group admin), because you’ll need root privileges to load your KEXT (you can use the sudo command).

  • You must be able to copy files between the computers using FTP, scp (SSH), rsync, AFP, or similar protocol or tool.

When all this is in place, complete the following steps:

  1. Target Set the NVRAM debug variable to 0x144, which lets you drop into the debugger upon a non-maskable interrupt (NMI) and, if you’re running OS X v. 10.2 or later, lets you debug two computers not on the same subnet. You can use setenv to set the flag within Open Firmware itself (on PowerPC-based Macintosh computers), or you can use the nvram utility. For the latter, enter the following as root at the command line:

    1. nvram boot-args="debug=0x144"

    It’s a good idea to enter nvram boot-args (no argument) first to get any current NVRAM variables in effect; then include these variables along with the debug flag when you give the nvram boot-args command a second time. Reboot the system.

  2. Host or Target Copy the driver (or any other kernel extension) to a working directory on the target computer.

  3. Host Set up a permanent network connection to the target computer via ARP. The following example assumes that your test computer is target.goober.com:

    $ ping -c 1 target.goober.com
    ping results: ....
    $ arp -an
    target.goober.com (10.0.0.69): 00:a0:13:12:65:31
    $ arp -s target.goober.com 00:a0:13:12:65:31
    $ arp -an
    target.goober.com (10.0.0.69) at00:a0:13:12:65:31 permanent

    This sequence of commands establishes a connection to the target computer (via ping), displays the information on recent connections ARP knows about (arp -an), makes the connection to the target computer permanent by specifying the Ethernet hardware address (arp -s), and issues the arp -an command a second time to verify this.

  4. Target Create symbol files for the driver and any other kernel extensions it depends on. First create a directory to hold the symbols; then run the kextload command-line tool, specifying the directory as the argument of the -s option:

    $ kextload -l -s /tmp/symbols /tmp/MyDriver.kext

    This command loads MyDriver.kext but, because of the -l option, doesn’t start the matching process yet (that happens in a later step). If you don’t want the driver to load just yet, specify the -n option along with the -s option. See “Using kextload, kextunload, and kextstat” for the kextload procedure for debugging a driver’s start-up code.

  5. Target or Host Copy the symbol files to the host computer.

  6. Host Optionally, if you want to debug your driver with access to all the symbols in the kernel, obtain or build a symboled kernel. For further information, contact Apple Developer Technical support. You can find the instructions for building the Darwin kernel from the Open Source code in the Building and Debugging Kernels in Kernel Programming Guide.

  7. Host Run gdb on the kernel.

    $ gdb /mach_kernel

    If you have a symboled kernel, specify the path to it rather than /mach_kernel. It is important that you run gdb on a kernel of the same version and build as the one that runs on the target computer. If the versions are different, you should obtain a symboled copy of the target’s kernel and use that.

  8. Host In gdb, add the symbol file of your driver.

    (gdb) add-symbol-file /tmp/symbols/com.acme.driver.MyDriver.sym

    Add the symbol files of the other kernel extensions in your driver’s dependency chain.

  9. Host Tell gdb that you will be debugging remotely.

    (gdb) target remote-kdp
  10. Target Break into kernel debugging mode. Depending on the model of your target system, either issue the appropriate keyboard command or press the programmer’s button. On USB keyboards, hold down the Command key and the Power button; on ADB keyboards, hold down the Control key and the Power button. If you’re running OS X version 10.4 or later, hold down the following five keys: Command, Option, Control, Shift, and Escape.

    You may have to hold down the keys or buttons for several seconds until you see the “Waiting for remote debugger connection” message.

  11. Host Attach to the target computer and set breakpoints.

    (gdb) attach target.goober.com
    (gdb) break 'MyDriverClass::WriteData(* char)'
    (gdb) continue

    Be sure you give the continue command; otherwise the target computer is unresponsive.

  12. Target Start the driver running.

    $ kextload -m -t /tmp/MyDriver.kext

    The -m option starts the matching process for the driver. The -t option, which tells kextload to conduct extensive validation checks, is really optional here; ideally, your driver should have passed these checks during an earlier stage of debugging (see “Using kextload, kextunload, and kextstat”). After starting the driver, perform the actions necessary to trigger the breakpoint.

  13. Host When the breakpoint you set is triggered, you can begin debugging your driver using gdb commands. If you “source” the kernel debugging macros (see the following section, “Using the Kernel Debugging Macros”), you can use those as well.

Using the Kernel Debugging Macros

Apple includes a set of kernel debugging macros as part of Darwin. They have been written by engineers with an intimate knowledge of how the Darwin kernel works. Although it is possible to debug driver code without these macros, they will make the task much easier.

The kernel debugging macros probe the internal structures of a running OS X system in considerable depth. With them you can get summary and detailed snapshots of tasks and their threads in the kernel, including such information as thread priority, executable names, and invoked functions. The kernel debugging macros also yield information on the kernel stacks for all or selected thread activations, on IPC spaces and port rights, on virtual-memory maps and map entries, and on allocation zones. See Table 7-2 for a summary of the kernel debugging macros.

You can obtain the kernel debugging macros from the Darwin Open Source repository. They are in the .gdbinit file in the /xnu/osfmk branch of the source tree. Because .gdbinit is the standard name of the initialization file for gdb, you might already have your own .gdbinit file to set up your debugging sessions. If this is the case, you can combine the contents of the files or have a “source” statement in one .gdbinit file that references the other file. To include the macros in a .gdbinit file for a debugging session, specify the following gdb command shortly after running gdb on mach_kernel:

(gdb) source /tmp/.gdbinit

(In this example, /tmp represents any directory that holds the copy of the .gdbinit file you obtained from the Open Source repository.) Because the kernel debugging macros can change between versions of the kernel, make sure that you use the macros that match as closely as possible the version of the kernel you’re debugging.

Table 7-2  Kernel debugging macros

Macro

Description

showalltasks

Displays a summary listing of tasks

showallacts

Displays a summary listing of all activations

showallstacks

Displays the kernel stacks for all activations

showallvm

Displays a summary listing of all the VM maps

showallvme

Displays a summary listing of all the VM map entries

showallipc

Displays a summary listing of all the IPC spaces

showallrights

Displays a summary listing of all the IPC rights

showallkmods

Displays a summary listing of all the kernel extension binaries

showtask

Displays status of the specified task

showtaskacts

Displays the status of all activations in the task

showtaskstacks

Displays all kernel stacks for all activations in the task

showtaskvm

Displays status of the specified task's VM map

showtaskvme

Displays a summary list of the task's VM map entries

showtaskipc

Displays status of the specified task's IPC space

showtaskrights

Displays a summary list of the task's IPC space entries

showact

Displays status of the specified thread activation

showactstack

Displays the kernel stack for the specified activation

showmap

Displays the status of the specified VM map

showmapvme

Displays a summary list of the specified VM map's entries

showipc

Displays the status of the specified IPC space

showrights

Displays a summary list of all the rights in an IPC space

showpid

Displays the status of the process identified by PID

showproc

Displays the status of the process identified by a proc pointer

showkmod

Displays information about a kernel extension binary

showkmodaddr

Given an address, displays the kernel extension binary and offset

zprint

Displays zone information

paniclog

Displays the panic log information

switchtoact

Switch thread context

switchtoctx

Switch context

resetctx

Reset context

A subset of the kernel debugging macros are particularly useful for driver writers: showallstacks, switchtoact, showkmodaddr, showallkmods, and switchtoctx. The output of showallstacks lists all tasks in the system and, for each task, the threads and the stacks associated with each thread. Listing 7-5 shows the information on a couple tasks as emitted by showallstacks.

Listing 7-5  Example thread stacks shown by showallstacks

(gdb) showallstacks
...
task        vm_map      ipc_space  #acts   pid  proc        command
0x00c1e620  0x00a79a2c  0x00c10ce0    2     51  0x00d60760  kextd
            activation  thread      pri  state  wait_queue  wait_event
            0x00c2a1f8  0x00ccab0c   31  W      0x00c9fee8  0x30a10c <ipc_mqueue_rcv>
                        continuation=0x1ef44 <ipc_mqueue_receive_continue>
            activation  thread      pri  state  wait_queue  wait_event
            0x00c29a48  0x00cca194   31  W      0x00310570  0x30a3a0 <kmod_cmd_queue>
                kernel_stack=0x04d48000
                stacktop=0x04d4bbe0
                0x04d4bbe0  0xccab0c
                0x04d4bc40  0x342d8 <thread_invoke+1104>
                0x04d4bca0  0x344b4 <thread_block_reason+212>
                0x04d4bd00  0x334e0 <thread_sleep_fast_usimple_lock+56>
                0x04d4bd50  0x81ee0 <kmod_control+248>
                0x04d4bdb0  0x45f1c <_Xkmod_control+192>
                0x04d4be00  0x2aa70 <ipc_kobject_server+276>
                0x04d4be50  0x253e4 <mach_msg_overwrite_trap+2848>
                0x04d4bf20  0x257e0 <mach_msg_trap+28>
                0x04d4bf70  0x92078 <.L_syscall_return>
                0x04d4bfc0  0x10000000
                stackbottom=0x04d4bfc0
 
task        vm_map      ipc_space  #acts   pid  proc        command
0x00c1e4c0  0x00a79930  0x00c10c88    1     65  0x00d608c8  update
            activation  thread      pri  state  wait_queue  wait_event
            0x00ddaa50  0x00ddbe34   31  W      0x00310780  0xd608c8 <rld_env+10471956>
                        continuation=0x1da528 <_sleep_continue>

The typical number of stacks revealed by showallstacks runs into the dozens. Most of the threads associated with these stacks are asleep, blocked on continuation (as is that for the second task shown in the above example). Stacks such as these you can usually ignore. The remaining stacks are significant because they reflect the activity going on in the system at a particular moment and context (as happens when an NMI or kernel panic occurs).

Thread activations and stacks in the kernel—including those of drivers—belong to the task named kernel_task (under the command column). When you’re debugging a driver, you look in the active stacks in kernel_task for any indication of your driver or its provider, client, or any other object it communicates with. If you add the symbol files for these driver objects before you begin the debugging session, the indication will be much clearer. Listing 7-6 shows an active driver-related thread in kernel_task in the context of adjacent threads.

Listing 7-6  Kernel thread stacks as shown by showallstacks

       activation  thread      pri  state  wait_queue  wait_event
       0x0101ac38  0x010957e4   80  UW     0x00311510  0x10b371c <rld_env+13953096>
                 continuation=0x2227d0 <_ZN10IOWorkLoop22threadMainContinuationEv>
       activation  thread      pri  state  wait_queue  wait_event
       0x0101aaf0  0x01095650   80  R
             stack_privilege=0x07950000
             kernel_stack=0x07950000
             stacktop=0x07953b90
             0x07953b90  0xdf239e4 <com.apple.driver.AppleUSBProKeyboard + 0x19e4>
             0x07953be0  0xe546694 <com.apple.iokit.IOUSBFamily + 0x2694>
             0x07953c40  0xe5a84b4 <com.apple.driver.AppleUSBOHCI + 0x34b4>
             0x07953d00  0xe5a8640 <com.apple.driver.AppleUSBOHCI + 0x3640>
             0x07953d60  0xe5a93bc <com.apple.driver.AppleUSBOHCI + 0x43bc>
             0x07953df0  0x2239a8 <_ZN22IOInterruptEventSource12checkForWorkEv+18>
             0x07953e40  0x222864 <_ZN10IOWorkLoop10threadMainEv+104>
             0x07953e90  0x2227d0 <_ZN10IOWorkLoop22threadMainContinuationEv>
             stackbottom=0x07953e90
       activation  thread      pri  state  wait_queue  wait_event
       0x0101b530  0x0101c328   80  UW     0x00311500  0x10b605c <rld_env+13963656>
                 continuation=0x2227d0 <_ZN10IOWorkLoop22threadMainContinuationEv>

You can use showallstacks in debugging panics, hangs, and wedges. For instance, it might reveal a pair of threads that are deadlocked against each other or it might help to identify a thread that is not handling interrupts properly, thus causing a system hang.

Another common technique using the kernel debugging macros is to run the showallstacks macro and find the stack or stacks that are most of interest. Then run the switchtoact macro, giving it the address of a thread activation, to switch to the context of that thread and its stack. From there you can get a backtrace, inspect frames and variables, and so on. Listing 7-7 shows this technique.

Listing 7-7  Switching to thread activation and examining it

(gdb) switchtoact 0x00c29a48
(gdb) bt
#0  0x00090448 in cswnovect ()
#1  0x0008f84c in switch_context (old=0xcca194, continuation=0, new=0xccab0c) at
/SourceCache/xnu/xnu-327/osfmk/ppc/pcb.c:235
#2  0x000344b4 in thread_block_reason (continuation=0, reason=0) at
/SourceCache/xnu/xnu-327/osfmk/kern/sched_prim.c:1629
#3  0x000334e0 in thread_sleep_fast_usimple_lock (event=0xeec500, lock=0x30a3ac,
interruptible=213844) at /SourceCache/xnu/xnu-327/osfmk/kern/sched_prim.c:626
#4  0x00081ee0 in kmod_control (host_priv=0xeec500, id=4144, flavor=213844,
data=0xc1202c, dataCount=0xc12048) at /SourceCache/xnu/xnu-327/osfmk/kern/kmod.c:602
#5  0x00045f1c in _Xkmod_control (InHeadP=0xc12010, OutHeadP=0xc12110) at
mach/host_priv_server.c:958
#6  0x0002aa70 in ipc_kobject_server (request=0xc12000) at
/SourceCache/xnu/xnu-327/osfmk/kern/ipc_kobject.c:309
#7  0x000253e4 in mach_msg_overwrite_trap (msg=0xf0080dd0, option=3, send_size=60,
rcv_size=60, rcv_name=3843, timeout=12685100, notify=172953600, rcv_msg=0x0,
scatter_list_size=0) at /SourceCache/xnu/xnu-327/osfmk/ipc/mach_msg.c:1601
#8  0x000257e0 in mach_msg_trap (msg=0xeec500, option=13410708, send_size=213844,
rcv_size=4144, rcv_name=172953600, timeout=178377984, notify=256) at
/SourceCache/xnu/xnu-327/osfmk/ipc/mach_msg.c:1853
#9  0x00092078 in .L_syscall_return ()
#10 0x10000000 in ?? ()
Cannot access memory at address 0xf0080d10
(gdb) f 4
#4  0x00081ee0 in kmod_control (host_priv=0xeec500, id=4144, flavor=213844,
data=0xc1202c, dataCount=0xc12048) at /SourceCache/xnu/xnu-327/osfmk/kern/kmod.c:602
602                     res = thread_sleep_simple_lock((event_t)&kmod_cmd_queue,
(gdb) l
597                 simple_lock(&kmod_queue_lock);
598
599                 if (queue_empty(&kmod_cmd_queue)) {
600                     wait_result_t res;
601
602                     res = thread_sleep_simple_lock((event_t)&kmod_cmd_queue,
603                                        &kmod_queue_lock,
604                                        THREAD_ABORTSAFE);
605                     if (queue_empty(&kmod_cmd_queue)) {
606                         // we must have been interrupted!

Remember that when use the switchtoact that you’ve actually changed the value of the stack pointer. You are in a different context than before. If you want to return to the former context, use the resetctx macro.

The showallkmods and showkmodaddr macros are also useful in driver debugging. The former macro lists all loaded kernel extensions in a format similar to the kextstat command-line utility (Listing 7-8 shows a few lines of output). If you give the showkmodaddr macro the address of an “anonymous” frame in a stack, and if the frame belongs to a driver (or other kernel extension), the macro prints information about the kernel extension.

Listing 7-8  Sample output from the showallkmods macro

(gdb) showallkmods
kmod        address     size        id      refs    version  name
0x0ebc39f4  0x0eb7d000  0x00048000  71      0       3.2  com.apple.filesystems.afpfs
0x0ea09480  0x0ea03000  0x00007000  70      0       2.1  com.apple.nke.asp_atp
0x0e9e0c60  0x0e9d9000  0x00008000  69      0       3.0  com.apple.nke.asp_tcp
0x0e22b13c  0x0e226000  0x00006000  68      0       1.2  com.apple.nke.IPFirewall
0x0e225600  0x0e220000  0x00006000  67      0       1.2  com.apple.nke.SharedIP
0x0df5d868  0x0df37000  0x00028000  62      0       1.2  com.apple.ATIRage128
0x0de96454  0x0de79000  0x0001e000  55      3       1.3 com.apple.iokit.IOAudioFamily
...

Tips on Using gdb

If you hope to become proficient at I/O Kit driver debugging, you’ll have to become proficient in the use of gdb. There’s no getting around this requirement. But even if you are already familiar with gdb, you can always benefit from insights garnered by other driver writers from their experience.

Examining Computer Instructions

If you don’t have symbols for a driver binary—and even if you do—you should try examining the computer instructions in memory to get a detailed view of what is going on in that binary. You use the gdb command x to examine memory in the current context; usually, x is followed by a slash (“/”) and one to three parameters, one of which is i. The examine-memory parameters are:

  • A repeat count

  • The display format: s (string), x (hexadecimal), or i (computer instruction)

  • The unit size: b (byte), h (halfword), w (word—four bytes), g (giant word—eight bytes)

For example, if you want to examine 10 instructions before and 10 instructions after the current context (as described in “Tips on Debugging Panics”), you could issue a command such as:

(gdb) x/20i $pc -40

This command says “show me 20 instructions, but starting 40 bytes” (4 bytes per instruction) “before the current address in the program counter” (the $pc variable). Of course, you could be less elaborate and give a simple command such as:

(gdb) x/10i 0x001c220c

which shows you 10 computer instructions starting at a specified address. Listing 7-9 shows you a typical block of instructions.

Listing 7-9  Typical output of the gdb “examine memory” command

(gdb) x/20i $pc-40
0x8257c <kmod_control+124>:     addi    r3,r27,-19540
0x82580 <kmod_control+128>:     bl      0x8d980 <get_cpu_data>
0x82584 <kmod_control+132>:     addi    r0,r30,-19552
0x82588 <kmod_control+136>:     lwz     r31,-19552(r30)
0x8258c <kmod_control+140>:     cmpw    r31,r0
0x82590 <kmod_control+144>:     bne+    0x825c0 <kmod_control+192>
0x82594 <kmod_control+148>:     mr      r3,r31
0x82598 <kmod_control+152>:     addi    r4,r27,-19540
0x8259c <kmod_control+156>:     li      r5,2
0x825a0 <kmod_control+160>:     bl      0x338a8
                                     <thread_sleep_fast_usimple_lock>
0x825a4 <kmod_control+164>:     lwz     r0,-19552(r30)
0x825a8 <kmod_control+168>:     cmpw    r0,r31
0x825ac <kmod_control+172>:     bne+    0x825c0 <kmod_control+192>
0x825b0 <kmod_control+176>:     addi    r3,r27,-19540
0x825b4 <kmod_control+180>:     bl      0x8da00 <fast_usimple_lock+32>
0x825b8 <kmod_control+184>:     li      r3,14
0x825bc <kmod_control+188>:     b       0x82678 <kmod_control+376>
0x825c0 <kmod_control+192>:     lis     r26,49
0x825c4 <kmod_control+196>:     li      r30,0
0x825c8 <kmod_control+200>:     lwz     r0,-19552(r26)

Needless to say, you need to know some assembler in order to make sense of the output of the examine-memory command. You don’t need to be an expert in assembler, just knowledgeable enough to recognize patterns. For example, it would be beneficial to know how pointer indirection with an object looks in computer instructions. With an object, there are two indirections really, one to get the data (and that could be null) and one to an object’s virtual table (the first field inside the object). If that field doesn’t point to either your code or kernel code, then there’s something that might be causing a null-pointer exception. If your assembler knowledge is rusty or non-existent, you can examine the computer instructions for your driver’s code that you know to be sound. By knowing how “healthy” code looks in assembler, you’ll be better prepared to spot divergences from the pattern.

Breakpoints

Using breakpoints to debug code inside the kernel can be a frustrating experience. Often kernel functions are called so frequently that, if you put a breakpoint on a function, it’s difficult to determine which particular case is the one with the problem. There are a few things you can do with breakpoints to ameliorate this.

  • Conditional breakpoints. A conditional breakpoint tells gdb to trigger a breakpoint only if a certain expression is true. The syntax is cond<breakpoint index> <expression>. An example is the following:

    (gdb) cond 1 (num > 0)

    However, conditional breakpoints are very slow in two-machine debugging. Unless you’re expecting the breakpoint expression to be evaluated only a couple dozen times or so, they are probably too tedious to rely on.

  • Cooperative breakpoints. To speed things up you can use two breakpoints that cooperate with each other. One breakpoint is a conditional breakpoint set at the critical but frequently invoked function. The other breakpoint, which has a command list attached to it, is set at a point in the code which is only arrived at after a series of events has occurred. You initially disable the conditional breakpoint and the second breakpoint enables it at some point later where the context is more pertinent to the problem you’re investigating (and so you don’t mind the slowness of expression evaluation). The following series of gdb commands sets up both breakpoints:

    (gdb) cond 1 (num > 0)
    (gdb) disable 1
    (gdb) com 2
        enable 1
        continue
    end

    If this debugging is something you do frequently, you can put breakpoint-setup commands into a macro and put that macro in your .gdbinit file.

  • Breakpoints at dummy functions. You can use the previous two techniques on code that you do not own. However, if it’s your code that you’re debugging, the fastest way to trigger a breakpoint exactly when and where you want is to use dummy functions. Consider the follow stripped code snippet:

    void dummy() {
        ;
    }
     
    void myFunction() {
        // ...
        if (num > 0)
            dummy();
        // ...
    }

    The expression in myFunction is the same exact expression you would have in a conditional breakpoint. Just set a breakpoint on the dummy function. When the breakpoint is triggered, get a backtrace, switch frames, and you’re in the desired context.

Single-Stepping

Single-stepping through source code does not necessarily take you from one line to the next. You can bounce around in the source code quite a bit because the compiler does various things with the symbols to optimize them.

There are two things you can do to get around this. If it’s your code you’re stepping through, you can turn off optimizations. Or you can single-step through the computer instructions in assembler because one line of source code typically generates several consecutive lines of assembler. So, if you find it hard to figure things out by single-stepping through source, try single-stepping through assembler.

To single-step in gdb, use the stepi command (si for short). You can get a better view of your progress if you also use the display command, as in this example:

(gdb) display/4i $pc

This displays the program counter and the next three instructions as you step.

Debugging Kernel Panics

You might be familiar with kernel panics: those unexpected events that cripple a system, leaving it completely unresponsive. When a panic occurs on OS X, the kernel prints information about the panic that you can analyze to find the cause of the panic. On pre-Jaguar systems, this information appears on the screen as a black and white text dump. Starting with the Jaguar release, a kernel panic causes the display of a message informing you that a problem occurred and requesting that you restart your computer. After rebooting, you can find the debug information on the panic in the file panic.log at /Library/Logs/.

If you’ve never seen it before, the information in panic.log might seem cryptic. Listing 7-10 shows a typical entry in the panic log.

Listing 7-10  Sample log entry for a kernel panic

Unresolved kernel trap(cpu 0): 0x300 - Data access DAR=0x00000058 PC=0x0b4255b4
Latest crash info for cpu 0:
   Exception state (sv=0x0AD86A00)
      PC=0x0B4255B4; MSR=0x00009030; DAR=0x00000058; DSISR=0x40000000; LR=0x0B4255A0;
      R1=0x04DE3B50; XCP=0x0000000C (0x300 - Data access)
      Backtrace:
         0x0B4255A0 0x000BA9F8 0x001D41F8 0x001D411C 0x001D6B90 0x0003ACCC
         0x0008EC84 0x0003D69C 0x0003D4FC 0x000276E0 0x0009108C 0xFFFFFFFF
      Kernel loadable modules in backtrace (with dependencies):
         com.acme.driver.MyDriver(1.6)@0xb409000
Proceeding back via exception chain:
   Exception state (sv=0x0AD86A00)
      previously dumped as "Latest" state. skipping...
   Exception state (sv=0x0B2BBA00)
      PC=0x90015BC8; MSR=0x0200F030; DAR=0x012DA94C; DSISR=0x40000000; LR=0x902498DC;
      R1=0xBFFFE140; XCP=0x00000030 (0xC00 - System call)
 
Kernel version:
Darwin Kernel Version 6.0:
Wed May  1 01:04:14 PDT 2002; root:xnu/xnu-282.obj~4/RELEASE_PPC

This block of information has several different parts, each with its own significance for debugging the problem.

  • The first line. The single most important bit of information about a panic is the first line, which briefly describes the nature of the panic. In this case, the panic has something to do with a data access exception. The registers that appear on the same line as the message are the ones with the most pertinent data; in this case they are the DAR (Data Access Register) and the PC (Program Counter) registers. The registers shown on the first line vary according to the type of kernel trap. The hexadecimal code before the description, which is defined in /xnu/osfmk/ppc_init.c, indicates the exception type. Table 7-3 describes the possible types of exceptions.

Table 7-3  Types of kernel exceptions

Trap Value

Type of Kernel Trap

0x100

System reset

0x200

Computer check

0x300

Data access

0x400

Instruction access

0x500

External interrupt

0x600

Alignment exception

0x700

Illegal instruction

  • The registers. Under the first “Exception state” is a snapshot of the contents of the CPU registers when the panic occurred.

  • The backtrace. Each hexadecimal address in the backtrace indicates the state of program execution at a particular point leading up to the panic. Because each of these addresses is actually that of the function return pointer, you need to subtract four from this address to see the instruction that was executed.

  • The kernel extensions. Under “Kernel loadable modules in backtrace” are the bundle identifiers (CFBundleIdentifier property) of all kernel extensions referenced in the backtrace and all other kernel extensions on which these extensions have dependencies. These are the kernel extensions for which you’ll probably want to generate symbol files prior to debugging the panic.

  • The other exception states. Under “Proceeding back via exception chain:” are the previous exception states the kernel experienced, separated by snapshots of the contents of the CPU registers at the time the exceptions occurred. Most of the time, the first exception state (immediately following the first line of the panic log) gives you enough information to determine what caused the panic. Sometimes, however, the panic is the result of an earlier exception and you can examine the chain of exceptions for more information.

  • The kernel version. The version of the Darwin kernel and, more importantly, the build version of the xnu project (the core part of the Darwin kernel). If you are debugging with a symboled kernel (as is recommended), you need to get or build the symboled kernel from this version of xnu.

General Procedure

There are many possible ways to debug a kernel panic, but the following course of action has proven fruitful in practice.

  1. Get as many binaries with debugging symbols as possible.

    Make a note of all the kernel extensions listed under “Kernel loadable modules in backtrace”. If you don’t have debugging symbols for some of them, try to obtained a symboled version of them or get the source and build one with debugging symbols. This would include mach_kernel, the I/O Kit families, and other KEXTs that are part of the default install. You need to have the same version of the kernel and KEXT binaries that the panicked computer does, or the symbols won’t line up correctly.

  2. Generate and add symbol files for each kernel extension in the backtrace.

    Once you’ve got the kernel extension binaries with (or without) debugging symbols, generate relocated symbol files for each KEXT in the backtrace. Use kextload with the -s and -n options to do this; kextload prompts you for the load address of each kernel extension, which you can get from the backtrace. Alternatively, you can specify the -a option with -s when using kextload to specify KEXTs and their load addresses. Although you don’t need to relocate symbol files for all kernel extensions, you can only decode stack frames in the kernel or in KEXTs that you have done this for. After you run gdb on mach_kernel (preferably symboled), use gdb’s add-symbol-file command for each relocatable symbol files you’ve generated; see “Setting Up for Two-Machine Debugging” for details.

  3. Decode the addresses in the panic log.

    Start with the PC register and possibly the LR (Link Register). (The contents of the LR should look like a valid text address, usually a little smaller than the PC-register address.) Then process each address in the backtrace, remembering to subtract four from each of the stack addresses to get the last instruction executed in that frame. One possible way to go about it is to use a pair of gdb commands for each address:

    (gdb) x/i <address>-4
    ...
    (gdb) info line *<address>-4

    You need the asterisk in front of the address in the info command because you are passing a raw address rather than the symbol gdb expects. The x command, on the other hand, expects a raw address so no asterisk is necessary.

    Listing 7-11 gives an example of a symbolic backtrace generated from x/i <address>-4. You’ll know you’ve succeeded when all the stack frames decode to some sort of branch instruction in assembler.

  4. Interpret the results.

    Interpreting the results of the previous step is the hardest phase of debugging panics because it isn’t mechanical in nature. See the following section, “Tips on Debugging Panics,” for some suggestions.

Listing 7-11  Example of symbolic backtrace

(gdb) x/i 0x001c2200-4
0x1c21fc <IOService::PMstop(void)+320>: bctrl
0xa538260 <IODisplay::stop(IOService *)+36>:    bctrl
0x1bbc34 <IOService::actionStop(IOService *, IOService *)+160>: bctrl
0x1ccda4 <runAction__10IOWorkLoopPFP8OSObjectPvn3_iPB2Pvn3+92>: bctrl
0x1bc434 <IOService::terminateWorker(unsigned long)+1824>:      bctrl
0x1bb1f0 <IOService::terminatePhase1(unsigned long)+928>:       bl
0x1bb20c <IOService::scheduleTerminatePhase2(unsigned long)>
0x1edfcc <IOADBController::powerStateWillChangeTo(unsigned long, unsigned long, IOService *)+88>:       bctrl
0x1c54e8 <IOService::inform(IOPMinformee *, bool)+204>: bctrl
0x1c5118 <IOService::notifyAll(bool)+84>:       bl
0x1c541c <IOService::inform(IOPMinformee *, bool)>
0x1c58b8 <IOService::parent_down_05(void)+36>:  bl
0x1c50c4 <IOService::notifyAll(bool)>
0x1c8364 <IOService::allowCancelCommon(void)+356>:      bl
0x1c5894 <IOService::parent_down_05(void)>
0x1c80a0 <IOService::serializedAllowPowerChange2(unsigned long)+84>:    bl
0x1c8200 <IOService::allowCancelCommon(void)>
0x1ce198 <IOCommandGate::runAction(int (*)(OSObject *, void *, void *, void *, void *), void *, void *, void *, void *)+184>: bctrl
0x1c802c <IOService::allowPowerChange(unsigned long)+72>:       bctrl
0xa52e6c ????
0x3dfe0 <_call_thread_continue+440>:    bctrl
0x333fc <thread_continue+144>:  bctrl

Tips on Debugging Panics

The following tips might make debugging a panic easier for you.

  • As noted previously, always pay attention to the first line of the panic message and the list of kernel extensions involved in the panic. The panic message provides the major clue to the problem. With the kernel extensions, generate relocated symbol files for the debugging session.

  • Don’t assume that the panic is not your driver’s fault just because it doesn’t show up in the backtrace. Passing a null pointer to an I/O Kit family or any other body of kernel code will cause a panic in that code. Because the kernel doesn’t have the resources to protect itself from null pointers, drivers must be extremely vigilant against passing them in.

  • The showallstacks kernel debugging macro is very useful for debugging panics. See “Using the Kernel Debugging Macros” for more information.

  • Use gdb’s $pc variable when debugging panics with gdb. The $pc variable holds the value of the program counter, which identifies the place where the panic exception was taken. If you want to examine the context of the panic, you could issue a command such as:

    (gdb) x/20i $pc -40

    This displays 10 instructions in assembler before and after the point where the panic occurred. If you have the appropriate source code and symbols, you can enter:

    (gdb) l *$pc

    This shows the particular line of code that took the panic.

  • If you have a panic caused by an Instruction Access Exception (0x400) and the PC register is zero, it means that something in the kernel branched to zero. The top frame in the stack is usually a jump through some function pointer that wasn’t initialized (or that somehow got “stomped”).

  • Panics caused by a Data Access Exception 0x300 are quite common. These types of panics typically involve indirection through zero (in other words, a dereferenced null pointer). Any time a null pointer is dereferenced, a panic results. When you get a Data Access Exception, first check the DAR register; if the value is less than about 1000 the panic is probably the result of indirection through a null pointer. This is because most classes are no larger than about 1000 bytes and when an offset is added to a null pointer, the result is about 1000 or less. If the result is much larger, it’s probable that the location of a pointer has been trashed (as opposed to its contents) and the contents of an unknown location is being used as a pointer in your code.

    A null pointer implies the possibility of a race condition with a shutdown or completion value. When called, a completion routine starts to free resources and if your driver is referencing these resources after the routine is called, it will probably get null pointers back.

  • If you get a panic in kalloc, kmem, or IOMalloc, it suggests that you’re using a freed pointer. Even if your driver code doesn’t show up in the backtrace, your driver could be the culprit. Using a freed pointer is likely to break the kernel’s internal data structures for allocation.

  • Panics can also be caused by accidentally scribbling on someone else’s memory. These “land mines” can be notoriously hard to debug; a backtrace doesn’t show you much, except that something bad has happened. If the panic is reproducible, however, you have a chance to track down the offending code by using a “probe” macro. A probe macro helps to bracket exactly where in the code the scribbling happened. By definition, a scribble is a byte that does not have the expected contents (because they were altered by some other code). By knowing where the panic occurred and where the byte last held its expected value, you know where to look for the scribbling code.

    Often it’s the case that your driver is the offending scribbler. To find your offending code (if any), define a probe macro that tests whether a memory address (A in the example below) has the expected value (N). (Note that A is the address in the DAR register of the original scribble panic.) If A doesn’t have the expected value, then cause a panic right then and there:

    Uint32 N = 123;
    Uint32 *A;
    A = &N;
    // ...
    #define PROBE() do {
        if (A && *A != N)
            *(Uint32)0 = 0;
        } while (0)

    By putting the probe macro in every function where A and N appear, you can narrow down the location of the scribble. If your driver is not the one doing the scribbling, it still might be indirectly responsible because it could be causing other code to scribble. For example, your driver might be asking some other code to write in your address space; if it’s passed them the wrong address, it might result in data being scribbled on inside it.

Debugging System Hangs

System hangs are, after kernel panics, the most serious condition caused by badly behaved kernel code. A hung system may not be completely unresponsive, but it is unusable because you cannot effectively click the mouse button or type a key. You can categorize system hangs, and their probable cause, by the behavior of the mouse cursor.

  • Cursor doesn’t spin and won’t move. This symptom indicates that a primary interrupt is not being delivered. The mouse doesn’t even spin because that behavior is on a primary interrupt; its not spinning indicates that the system is in a very tight loop. What has probably happened is that a driver object has disabled an interrupt, causing code somewhere in the driver stack to go into an infinite loop. In other words, a piece of hardware has raised an interrupt but the driver that should handle it is not handling it and so the hardware keeps raising it. The driver has probably caused this “ghost” interrupt, but is unaware it has and so is not clearing it.

  • Cursor spins but won’t move. This symptom indicates that a high-priority thread such as a timer is spinning.

  • Cursor spins and moves, but nothing else. This symptom suggests that the USB thread is still scheduled, thus indicating a deadlock in some driver object that is not related to USB or HI.

For system hangs with the first symptom—the cursor doesn’t spin and won’t move—your first aim should be to find out what caused the interrupt. Why is the hardware controlled by your driver raising the interrupt? If your driver is using a filter interrupt event source (IOFilterInterruptEventSource), you might want to investigate that, too. With a filter event source a driver can ignore interrupts that it thinks aren’t its responsibility.

With any system hang, you should launch gdb on the kernel, attach to the hung system and run the showallstacks macro. Scan the output for threads that are deadlocked against each other. Or, if it’s an unhandled primary interrupt that you suspect, find the running thread; if it is the one that took the interrupt, it is probably the thread that’s gone into an infinite loop. If the driver is in an infinite loop, you can set a breakpoint in a frame of the thread’s stack that is the possible culprit; when you continue and hit the breakpoint almost immediately, you know you’re in an infinite loop. You can single-step from there to find the problem.

Debugging Boot Drivers

The OS X BootX booter copies drivers for hardware required in the boot process into memory for the kernel’s boot-time loading code to load. Because boot drivers are already loaded by the time the system comes up, you do not have as much control over them as you do over non-boot drivers. In addition, a badly behaving boot driver can cause your system to become unusable until you are able to unload it. For these reasons, debugging techniques for boot drivers vary somewhat from those for other drivers.

The most important step you can take is to treat your boot driver as a non-boot driver while you are in the development phase. Remove the OSBundleRequired property from your driver’s Info.plist file and use the techniques described in this chapter to make sure the driver is performing all its functions correctly before you declare it to be a boot driver.

After you’ve thoroughly tested your driver, add the OSBundleRequired property to its Info.plist (see the document Loading Kernel Extensions at Boot Time to determine which value your driver should declare). This will cause the BootX booter to load your driver into memory during the boot process.

If your boot driver does have bugs you were unable to find before, you cannot use gdb to debug it because it is not possible to attach to a computer while it is booting. Instead, you must rely on IOLog output to find out what is happening. IOLog is synchronous when you perform a verbose boot so you can use IOLog statements throughout your boot driver’s code to track down the bugs. See “Using IOLog” for more information on this function.

To perform a verbose boot, reboot holding down both the Command and V keys. To get even more detail from the I/O Kit, you can set a boot-args flag before rebooting. Assuming root privileges with the sudo command, type the following on the command line

%sudo nvram boot-args="io=0xffff"
Password:
%shutdown -r now

Although this technique produces voluminous output, it can be difficult to examine because it scrolls off the screen during the boot process. If your boot driver does not prevent the system from completing the boot process, you can view the information in its entirety in the system log at /var/log/system.log.

Logging

Logging, the capture and display of information at certain points in the code, is useful in some situations as a debugging tool because it provides a linear record of what happens during the execution of driver code. This section describes the I/O Kit’s primary logging function, IOLog, and discusses ways you can go about creating your own logging facility.

Using IOLog

It is natural to rely on printf statements to display what’s going on in your application code during the debugging phase. For in-kernel drivers, the I/O Kit equivalent to the printf statement is the IOLog function, defined in /System/Library/Frameworks/Kernel.framework/Headers/IOKit/IOLib.h. Unfortunately, IOLog is only marginally useful for debugging kernel-resident drivers.

Because it executes inside the kernel, IOLog is necessarily resource-frugal. The message buffer is not large and if you use IOLog to log I/O activity, for example, the older messages are likely to be overwritten by newer ones, erasing the information you need for debugging. In general, you cannot successfully use IOLog in any tight loop in your driver’s code, whether they are I/O loops or unintentional loops caused by errors in the code. The only way to ensure you get all the information you need in such a situation is to allocate your own buffer and create your own logging facility to write to it (for more information on this technique, see “Custom Event Logging ”).

If your driver is causing a panic, IOLog will not help much in determining the cause. Because the IOLog function is asynchronous, when the computer panics, the last IOLog message is still waiting to be scheduled for output so you will miss the message containing information about the cause of the panic.

You cannot call IOLog from an interrupt context. When you create an IOFilterInterruptEventSource object, you send it a filter function and an action function. You cannot use IOLog in your filter routine because it runs at primary (hardware) interrupt time when very little of the system is available. You can, however, call IOLog in your action routine because it runs on the work loop the event source is attached to. Note that when you create an IOInterruptEventSource object, you send it an action function that runs on the event source’s work loop, but no filter function. As with the action function for an IOFilterInterruptEventSource object, you can call IOLog in the action function for an IOInterruptEventSource object.

Although the IOLog function is not well-suited for use in the debugging process, it is useful for logging normal status and error messages. For example, the PhantomAudioDevice class (contained in the PhantomAudioDriver project at /Developer/Examples/Kernel/IOKit/Audio/PhantomAudioDriver) uses IOLog to log status in several of its methods. Listing 7-12 shows partial listings of PhantomAudioDevice’s createAudioEngines and volumeChanged methods.

Listing 7-12  Using IOLog in PhantomAudioDevice methods

bool PhantomAudioDevice::createAudioEngines()
{
    ...
    IOLog("PhantomAudioDevice[%p]::createAudioEngines()\n", this);
    audioEngineArray = OSDynamicCast(OSArray,
                                getProperty(AUDIO_ENGINES_KEY));
    if (audioEngineArray) {
        /* Create new audio engine with AudioEngine array in personality. */
        ...
    }
    else {
        IOLog("PhantomAudioDevice[%p]::createAudioEngines() - Error: no
                            AudioEngine array in personality.\n", this);
        goto Done;
    }
    ...
}
 
 
IOReturn PhantomAudioDevice::volumeChanged(IOAudioControl *volumeControl,
                                        SInt32 oldValue, SInt32 newValue)
{
    IOLog("PhantomAudioDevice[%p]::volumeChanged(%p, %ld, %ld)\n", this,
                                    volumeControl, oldValue, newValue);
...
}

Custom Event Logging

You can pepper your code with IOLog or printf calls as a debugging technique, but, as noted in the previous section (“Using IOLog”), this approach has its limitations. Some drivers are especially sensitive to any impact on performance, and require more reliable and finer-grained logging data. If your driver falls into this category, you can create your own logging facility.

There are various ways to implement a custom debug-logging solution that avoids the drawbacks of IOLog. You would need to allocate your own buffer for you log entries, implement functions and macros for writing those entries to the buffer, and devise some means for examining the contents of the buffer.

For purposes of illustration, this section looks at the logging feature of the AppleGMACEthernet driver. You can find the source code for this driver (including the logging tools) in the Darwin source code for OS X version 10.2.7 and above, available at http://www.opensource.apple.com/darwinsource/index.html. To find the source code, select a version of OS X equal to or greater than v10.2.7 and click Source (choose the source for the PPC version, if there's a choice). This displays a new page, which lists the open source projects available for the version of OS X you've chosen. Scroll down to AppleGMACEthernet and click it to view the source. Be prepared to supply your ADC member name and password.

The AppleGMACEthernet driver implements a kernel-resident logging infrastructure as well as a user-space tool to print the formatted contents of a log buffer. The tool used by this driver prints a report to the standard output that begins with header information followed by time-stamped log entries. You can redirect the output to a file and then add comments to the entries in the file. Listing 7-13 shows the first several seconds of the driver’s life (along with comments).

Listing 7-13  Sample output of a custom event log

 0  1376a020  13779fe0  1376a020    [ffffffe0]
  27   38e9000         0         1
 
 
  20:   39  291159         0         0  eNet
  30:   39  291266   19aec90   181c200  Strt
  40:   39  293857   19aec90   18a4000  KeyL
  50:   39  293899       100       100  =txq
  60:   39  293903       100       100  =txe
  70:   39  293907        40        40  =rxe
  80:   39  293908       100   1000040  parm
  90:   39  307103      1000  135f0000  =TxR
  a0:   39  307111       400   198b000  =RxR
  b0:   39  307119   17f4400   19c8f00  arys
  c0:   39  307120   19aec90         0  AttI
  d0:   39  307232   19aec90         0  RwPM
  e0:   39  307247   19aec90         2  mx4d
  f0:   39  307249   19aec90         2  ip4d
 100:   39  307316   19aec90         1  Pwr!
 110:   39  307436   19aec90   198e400  cfig
 120:   39  307465   19aec90         0  AttD
 130:   39  307598   19aec90         0  powr
 140:   39  307599   19aec90         0  RegS
 150:   39  310366   19aec90         0  Exit    end of start method
 160:   46  944895   191d400   198e400  NetE
 170:   46  944899   191d400         0  Wake    wake up ethernet cell
 180:   46  944938         0         0  +Clk
 190:   46  945093   199a5c0  1ee21000  Adrs
 1a0:   46  945096   19c8f00   198b000  IRxR
 1b0:   46  946111   191d400  135f0000  ITxR
 1c0:   46  946127         3      1010  wReg
 1d0:   46  946135         0        ff  ChpI    chip initialization
 1e0:   46  946136         4      9050

The report header consists of two lines. The first line identifies any flag in effect and gives the start, ending, and current addresses of the log buffer. The second line indicates the period (in seconds) spanned by the log entries, shows the physical address of the buffer, and tallies the numbers of alerts and buffer wrap-arounds since the driver started up.

Each log entry is identified by an offset (in hexadecimal) from the start of the buffer and a two-part time-stamp. The first part is the time in seconds since the driver started executing and the second is the time in microseconds (actually nanoseconds right-shifted by 10 to closely approximate microseconds). The final column contains four-character codes with special significance to the AppleGMACEthernet driver. The intermediate two columns show parameters whose value depends on the context indicated by the code.

The AppleGMACEthernet driver (in the UniEnet class) implements logging in a way that minimizes its effect on performance. The major steps it undertakes are the following:

  • It allocates a 64 kilobyte wrap-around buffer to hold the logging data and defines global pointers to that buffer.

  • It implements two low-level functions to log events to the buffer, one for general events and the other for alerts; alerts are supersets of events which first perform the log, then the IOLog, and then whatever you customize it to do.

  • For convenience, it also defines macros built on these functions.

  • It inserts macro calls at critical points in the code.

  • It implements a user-space tool to format and print the contents of the log buffer.

  • It implements a user client to send the buffer data to the tool.

Listing 7-14 shows the definitions of the event-log and alert functions and macros.

Listing 7-14  Definition of logging macros and functions

#if USE_ELG /* (( */
#define ELG(A,B,ASCI,STRING)    EvLog( (UInt32)(A), (UInt32)(B),
                                        (UInt32)(ASCI), STRING )
#define ALERT(A,B,ASCI,STRING)  Alert( (UInt32)(A), (UInt32)(B),
                                        (UInt32)(ASCI), STRING )
    void EvLog( UInt32 a, UInt32 b, UInt32 ascii, char* str );
    void Alert( UInt32 a, UInt32 b, UInt32 ascii, char* str );
#else /* ) not USE_ELG: (   */
#define ELG(A,B,ASCI,S)
#define ALERT(A,B,ASCI,STRING)  IOLog( "UniNEnet: %8x %8x " STRING "\n",
                                (unsigned int)(A), (unsigned int)(B) )
#endif /* USE_ELG )) */

If you’re curious, see the UniEnet class source for the implementations of EvLog and Alert.

The AppleGMACEthernet driver calls the ELG and ALERT macros at appropriate points in the source code. In the example in Listing 7-15, the ELG macro is called just after the configureInterface member function is invoked; the event code is ‘cfig’ and the parameters are the addresses of the current thread and the IONetworkInterface object passed in. (The final string parameter is printed in alerts and via IOLog if the custom debugging feature is disabled.)

Listing 7-15  Calling the logging macro

bool UniNEnet::configureInterface( IONetworkInterface *netif ) {
                                    IONetworkData *nd;
    ELG( IOThreadSelf(), netif, 'cfig', "configureInterface" );
 // ...

The report shown in Listing 7-13 includes a log entry from this macro call, illustrating how it might appear (it’s the sixteenth one).

Although it might be convenient to have a tool that nicely prints the contents of the log buffer, there is considerable time and effort involved in developing this tool and the user-client object that must feed it the logging data. Instead of a user-space tool, you can examine the contents of the log buffer in gdb. The contents of the buffer, however, will require more effort on your part to interpret.