Article

Using Metal System Trace in Instruments to Profile Your App

Smooth out your frame rate by checking for issues in your app's CPU and GPU utilization.

Overview

A low-performing frame rate can cause an app to feel sluggish or disruptive to its users, so it's important to remove temporary interruptions, or stutters, to optimize your app's user experience. To get information about the cause of slowness in your app's frame rate, you use Xcode's Game Performance instrument, which combines threading and system call information with the Metal System Trace instrument. By presenting important app states and rendering activities, Game Performance helps you infer the changes that are necessary to achieve consistent, smooth rendering.

Open the Template

Start the performance analysis from your Xcode project by choosing Product > Profile, or by pressing Command+I (⌘I). In the template selection window, select Game Performance.

Figure 1

Choosing the Game Performance instrument

Screenshot of the Instrument's template window open with the Game Performance template highlighted.

Capture Results

To collect the data that's necessary to analyze your app's frame rate, begin by clicking the record button called out in Figure 2.

Figure 2

The instrument record button

Screenshot showing the Game Performance Instrument in its initial presentation of the run results.

Within your app, perform the actions that reproduce a slow frame rate, and then click the record button again to stop recording. You'll find the captured results in Xcode's center pane.

Identify Performance Anomalies

To expedite your review of the capture results, narrow your focus around the time the frame rate was slow. Sometimes a frame rate anomaly is caused by infrequently skipped frames, and other times it's caused by a consistently poor frame rate. In either case, you identify frame rate anomalies by finding unexpected delay in your app's display times.

For example, callout 1 in Figure 3 highlights a display instance that took 250 milliseconds (ms) to complete. Callout 2 shows you how many vertical synchronization (vsync) events were skipped over that time.

Figure 3

Multiple vsyncs spanning a display instance

Screenshot of the display instrument with the mouse hovering over an instance that skipped many frames.

Because the 250 ms display instance is significantly longer than the display instances before it, the delay in delivering its frame will be interpreted by the user as a stutter.

In contrast, Figure 4 shows an app that maintained a consistent frame rate. In the display results, hover the mouse over a frame to check its duration.

Figure 4

Consistent 60 fps frame rate

Screenshot of the Display instrument with the mouse hovering over a display instance to check its duration.

A duration of 16.67 ms is one 60 fps frame, and because all other frames in Figure 4 consistently achieve this frame duration, there's no performance anomaly to observe.

Not all displays use a ~16 ms frame interval; for example, vertical synchronization happens every ~4 ms on ProMotion displays. So it isn't required that display instances align with vsync for a healthy frame rate––an app that uses a 20 ms frame interval is considered to have a healthy frame rate as long as it consistently achieves 50 fps. However, the delay shown in callout 1 of Figure 3––250 ms––is much too long for smooth animations.

Check Shader Core Utilization

After finding a performance anomaly, look to the GPU activities occurring around that time for the cause. The GPU hardware track shows your shader pipeline stages, which are collectively referred to as shader core. Any long-running stages or inconsistent durations in the track timeline can indicate a utilization issue. For example, Figure 5 shows a case where display spanned two frame intervals, which means the app unintentionally skipped a frame. To begin investigating shader core utilization as a potential cause of poor frame rate:

  1. Observe the performance anomaly; in this case, the display spanned two frame intervals.

  2. Note that the vertex shader is healthy because it completes in a small percentage of the frame interval.

  3. Hover the mouse over the fragment shader to see its duration; in this case, it ran for 36 ms, which is too long.

Figure 5

Over-utilization of the fragment shader

Screenshot of the GPU hardware results indicating that a long-running fragment shader is responsible for frameskipping.

Because the combined duration of the vertex and fragment shader is more than the duration of a 60 fps frame interval (16.67 ms), the app skipped a frame. The vertex shader ran quickly in this case, which means the app's frame rate issues are caused solely by fragment shader over-utilization.

The following are additional reasons the shader core may be over-utilized:

Too many render passes

Indicated by the renderer depriving the GPU of downtime. Check the number of render passes that occur at the time of the poor frame rate by using the dependency viewer. For more information, refer to Viewing Your Frame Graph.

High resolution

Inidicated by critically more fragment shader activity per the same number of submitted vertices, as compared to when your viewport is set to the smaller size. To ensure your app's viewport is not related to the slowdown, temporarily reduce viewport size to see if performance improves.

Large textures

Indicated by high synchronization time when profiling your fragment shader. The profiler shows a high percentage of time in "wait memory", as seen in Table 1 in Optimizing Performance with the Shader Profiler.

Large meshes

Indicated by a high number of vertices submitted by your app. Check the affected frame(s) using the geometry viewer. For more information, see Viewing Your Meshes with the Geometry Viewer.

Unoptimized shader code

Indicated by general shader sluggishness. If you're able to modify your app's shaders, profile them to identify hot spots, like those covered in Table 1 in Optimizing Performance with the Shader Profiler. For example, you can optimize your shaders by downsizing data types, or minimizing the use of control structures. Note that you may not know ahead of time whether your shaders can benefit from optimization until you try it out.

Check CPU Utilization

While checking your shader core utilization, look out for signs that indicate problems with your app's CPU utilization. Figure 6 shows a case where frame skipping appears to be caused by something other than the shader core.

  1. In the GPU hardware tracks, look at the moment when the shader pipeline stages are complete. This is the when the final stage in the pipeline––the fragment shader––is done.

  2. Observe a ~1.5 frame-interval gap between display and when the shader core is run again.

  3. Observe a ~13 frame-interval gap between display and when the shader core is run again.

Figure 6

GPU idling for 225 ms

Screenshot of the GPU hardware results showing the GPU idles waiting for the CPU to display.

When display spans multiple frame intervals and there are gaps in the shader core timeline as shown in callouts 2 and 3 in Figure 6, it indicates that your host app's code is running long. Next, inspect your app's CPU utilization to consider whether it's responsible for poor frame rate.

Check for Long-Running Host App Code

To check your app's CPU utilization, identify your rendering thread(s) in the thread state tracks. In the case of healthy CPU utilization, your app's rendering thread(s) should show a significant amount of blocked time. Figure 7 shows an app's rendering thread selected and highlights its blocked time over a ~16 ms frame interval.

Figure 7

Checking CPU utilization

Screenshot showing that a significant amount of Blocked time signals healthy host app code.

Blocked time indicates that your renderer finished submitting its draw calls with some time to spare in the frame interval. Because the amount of blocked time shown in Figure 7 encompasses about two-thirds of its frame interval, the host app has left enough time for the shader core to start and finish its work within the same frame interval.

By contrast, if your rendering thread(s) don't show much blocked time, it's likely that your app is over-utilizing the CPU. To identify whether your app's CPU is over-utilized and to get more information about why, follow these steps, called out in Figure 8:

  1. Observe a stutter as identified by a longer than 16.67 ms display duration.

  2. Ensure that shader code isn't the cause of the low frame rate. See Check Shader Core Utilization for more information.

  3. Check the thread results for "Running" that are colored in blue.

  4. Click the thread's track to select it.

  5. Choose Profile from the view selection menu.

  6. Disclose the results list items and look for the highest weight to find the method(s) that are spending the most time in your host app code.

Figure 8

Long-running host app code

The thread state tracks are shown with an app's rendering thread selected. CPU time profiler information is shown for the selected thread in the lower pane, where a long-running host app method is selected.

A thread's time spent running is represented by the collection of blue and orange areas in the track (see callout 3 in Figure 8). If a frame interval has little blocked time, it indicates CPU over-utilization. To resolve the issue, focus your optimization efforts on improving slow-running code, like tuning the method marked by callout 6. Because the long-running methods are in your host app, you should have an idea of whether––and how––you can optimize them to run faster.

Check CPU-GPU Pipelining

In addition to shader core and CPU utilization, more subtle causes of low frame rate involve CPU-GPU pipelining. In this context, pipelining refers to how well your app coordinates the efforts of the CPU and GPU, while maintaining a consistent frame rate. The following sections cover issues that can result from poor CPU-GPU pipelining.

Check CPU-GPU Overlap

By minimizing the amount of time the CPU and GPU wait on each other, you maximize the amount of work each chip does in parallel. That's what's meant by CPU-GPU overlap. For example, Metal provides indirect command buffers (ICBs) to increase overlap; by generating rendering commands on the GPU using ICBs, you avoid CPU waiting that otherwise results when you render with a compute kernel. For more information, see Encoding Indirect Command Buffers on the GPU.

Check Thread Prioritization

Your app can get preempted by other processes if you misconfigure thread priority. To consider these kinds of thread-related pipelining issues, check the User Interactive Load track.

Figure 9

Thread-related issues marked in orange.

Screenshot of the User Interactive Load instrument reflecting a significant amount of yellow areas. The existence and size of the orange areas indicate there are not enough CPU cores to process the number of runnable threads.

The orange spikes in Figure 9 indicate that runnable threads outnumbered the CPU cores available to process them. The green areas indicate the opposite––the healthy situation where enough CPU cores were available for processing. To deal with the problematic orange situations, you can use fewer threads, and increase the priority of your app's threads.

Check Low Thread Priority

To confirm whether low thread priority is affecting your app's frame rate, follow these steps and see the corresponding callouts in Figure 10:

  1. Observe long-running display instances.

  2. Visually confirm there are a number of skipped frames; for an app that uses a 60 fps frame interval, you'll see that vsyncs don't align with display.

  3. Select the User Interactive Load instrument.

  4. In the center pane, click and drag to select the area containing the performance anomaly observed in callout 1.

  5. Select your app in the bottom pane.

  6. Observe your app's thread state.

  7. Observe your app's thread priority.

Figure 10

Stuttering accompanied by a relatively low thread priority

Screenshot of the User Interactive Load instrument with your app selected at bottom. A state of Preempted is highlighted to indicate that a low thread priority is responsible for stuttering.

The Preempted thread state indicates that other Runnable and Running threads starved your app's thread of processing. Low thread-priority is an example of how misconfigured host app code relates to low frame rate. A priority of 45 is recommended for game rendering threads in iOS. To set your thread's priority, call pthread_attr_setschedparam(_:_:) before creating your thread with pthread_create(_:_:_:_:).

See Also

Tools

Developing Metal Apps that Run in Simulator

Prototype and test your Metal apps in Simulator.

Supporting Simulator in a Metal App

Modify Metal Apps to Run in Simulator.

Frame Capture Debugging Tools

Analyze and optimize your app performance at runtime.

Optimizing Performance with the GPU Counters Instrument

Examine your app's use of GPU resources in Instruments, and tune your app as needed.