Important: This document may not represent best practices for current development. Links to downloads and other resources may no longer be valid.
Shark is a tool for performance understanding and optimization. Why is it called “Shark?” Performance tuning requires a hunter’s mentality, and no animal is as pure in this quest as a shark. A shark is also an expert in his field — one who uses all potential resources to achieve his goals. The name “Shark” embodies the spirit and emotion you should have when tuning your code.
To help you analyze the performance of your code, Shark allows you to profile the entire system (kernel and drivers as well as applications). At the simplest level, Shark profiles the system while your code is running to see where time is being spent. It can also produce profiles of hardware and software performance events such as cache misses, virtual memory activity, memory allocations, function calls, or instruction dependency stalls. This information is an invaluable first step in your performance tuning effort so you can see which parts of your code or the system are the bottlenecks.
In addition to showing you where time is being spent, Shark can give you advice on how to improve your code. Shark is capable of identifying many common performance pitfalls and visually presents the costs of these problems to you.
The first and most important step when optimizing your code is to determine what to optimize. In a program of moderate complexity, there can be thousands of different code paths. Optimizing all of them is normally impractical due to deadlines and limited programmer resources. There are also more subtle tradeoffs between optimized code and portability and maintenance that limit candidates for optimization.
Here are a few general guidelines for finding a good candidate for optimization:
It should be time-critical. This is generally any operation that is perceptibly slow; the user has to wait for the computer to finish doing something before continuing. Optimizing functionality that is already faster than the user can perceive is usually unnecessary.
It must be relevant. Optimizing functionality that is rarely used is usually counter-productive.
It shows up as a hot spot in a time profile. If there is no obvious hot spot in your code or you are spending a lot of time in system libraries, performance is more likely to improve through high-level improvements (architectural changes).
Low-level optimizations typically focus on a single segment of code and make it a better match to the hardware and software systems it is being run on. Examples of low-level optimizations include using vector or cache hint instructions. High-level optimizations include algorithmic or other architectural changes to your program. Examples of high-level optimizations include data structure choice (for example, switching from a linked list to a hash-table) or replacing calls to computationally expensive functions with a cache or lookup table.
Remember, it is critical to profile before investing your time and effort in optimization. Sadly, many programmers invest prodigious amounts of effort optimizing what their intuition tells them is the performance-critical section of code only to realize no performance improvement. Profiling quickly reveals that bottlenecks often lie far from where programmers might assume they are. Using Shark, you can focus your optimization efforts on both algorithmic changes and tuning performance-critical code. Often, even small changes to a critical piece of code can yield large overall performance improvements.
By default, Shark creates a profile of execution behavior by periodically interrupting each processor in the system and sampling the currently running process, thread, and instruction address as well as the function callstack. Along with this contextual information, Shark can record the values of hardware and software performance counters. Each counter is capable of counting a wide variety of performance events. In the case of processor and memory controller counters, these include detailed, low-level information that is otherwise impossible to know without a simulator. The overhead for sampling with Shark is extremely low because all sample collection takes place in the kernel and is based on hardware interrupts. A typical sample incurs an overhead on the order of 20μs. This overhead can be significantly larger if callstack recording is enabled and a virtual memory fault is incurred while saving the callstack. Time profiles generated by Shark are statistical in nature; they give a representative view of what was running on the system during a sampling session . Samples can include all of the processes running on the system from both user and supervisor code, or samples can be limited to a specific process or execution state. Shark’s sampling period can be an arbitrary time interval (timer sampling). Shark also has the ability to use a performance event as the sampling trigger (event sampling). Using event sampling, it is possible to associate performance events such as cache misses or instruction stalls with the code that caused them. Additionally, Shark can generate exact profiles for specific function calls or memory allocations.
Organization of This Document
This manual is organized into four major sections, each consisting of two or three chapters, plus several appendices. Here is a brief “roadmap” to help you orient yourself:
Getting Started with Shark— This introduction and “Getting Started with Shark” are designed to give you an overall introduction to Shark. After covering some basic philosophy here, “Getting Started with Shark” describes basic ways to use Shark to sample your applications, features of the Session windows that open after you sample your applications, and the use of Shark’s global preferences.
Profiling Configurations— Three chapters discuss Shark’s default Configurations — its methods of collecting samples from your system or applications — and presentation of the sampled results in Session windows. These chapters are probably the most important ones. “Time Profiling” discusses Time Profiling, the most frequently used configuration, which gives a statistical profile of processor utilization. System Tracing, discussed in “System Tracing,” provides an exact trace of user-kernel transitions, and is useful both to debug interactions between your program and the underlying system and to provide a “microscope” to examine multithreaded programming issues in detail. After the complete chapters devoted to these two configurations, the remainder are covered in “Other Profiling and Tracing Techniques.” Time Profile (All Thread States) is a variant of Time Profile that also samples blocked threads, and as a result is a good way to get an overview of locking behavior in multithreaded applications. Malloc Trace allows you to examine memory allocation and deallocation activity in detail. Shark can apply Static Analysis to your application in order to quickly examine rarely-traversed code paths. Equivalents for Time Profile, Malloc Trace, and an exact Call Trace, all customized for use with Java applications, are also available. Finally, the chapter gives an overview of Shark’s extensive performance counter recording and analysis capabilities.
Advanced Techniques— Shark’s basic techniques for sampling and analysis are sufficient for most purposes, but with complex applications you may need more sophisticated techniques. “Advanced Profiling Control” covers ways to start and stop Shark’s sampling very precisely, allowing you to carefully control what is sampled, in advance. You can also learn how to control Shark remotely from other machines or even to control Shark running on iOS devices attached to your machine in this chapter. “Advanced Session Management and Data Mining” looks at Shark’s symbol management and data mining techniques, which are ways to very precisely select subsets of your samples for examination after they are taken.
Custom Configurations— Shark is not just limited to its default configurations. If you want to save your own custom settings for a configuration or create a new one from scratch, then you will want to check out chapters “Custom Configurations” and “Hardware Counter Configuration.” The first describes how you may make adjustments to the existing configurations, while the latter covers the many options relating to the use of hardware performance counters. Because there are so many different possible combinations of performance counters, only a limited number of the possibilities are covered by the default configurations. Hence, this is likely to be the main area where the use of custom configurations will be necessary for typical Shark users.
Appendices— The first appendix, “Command Reference,” provides a brief reference to Shark’s menu commands. The second, “Miscellaneous Topics,” describes several minor, miscellaneous options that do not really fit in anywhere else or are of interest only to a small minority of Shark users. The remainder of the appendices (“Intel Core Performance Counter Event List,” “Intel Core 2 Performance Counter Event List,” “PPC 750 (G3) Performance Counter Event List,” “PPC 7400 (G4) Performance Counter Event List,” “PPC 7450 (G4+) Performance Counter Event List,” “PPC 970 (G5) Performance Counter Event List,” “UniNorth-2 (U1.5/2) Performance Counter Event List,” “UniNorth-3 (U3) Performance Counter Event List,” and “Kodiak (U4) Performance Counter Event List”) provide a reference for the performance counters that you can measure with Shark.