Track down even the trickiest GPU-side programming errors with enhanced reporting in Xcode 12. While Metal's API validation layer can catch most problems in a project, GPU errors can cause a host of difficult-to-debug issues.
Get an introduction to GPU-side errors and learn how to find and eliminate problems like visual corruption, infinite loop timeouts, out of bounds memory accesses, nil resource access, or invalid resource residency with Xcode 12. Discover how to enable enhanced command buffer error reporting and shader validation, use them effectively as part of your debugging strategy, and automate them in your production pipeline.
Hi, I'm Michael Harris and I'm a GPU Software Engineer at Apple. Today, I'd like to talk about the improvements we've made to Metal's debugging tools. Specifically, for errors in GPU side Metal Shader code. So what are a few examples of errors that we can make in Metal Shader code? You could have an out of bounds access in global or shared memory. You could attempt to access a nil texture resource. Or you might have forgotten to call use resource when using argument buffers resulting in invalid resource residency.
You may have a timeout which can be caused by long running or infinite loops. This isn't an exhaustive list but it's some of the more common errors we Metal developers experience. These errors can often cause one another. An infinite loop may be caused by an out-of-bounds axis of the loop iteration count.
The result of the GPU side error is a message like this one. In comparison, here's what we get from an API usage error on the CPU.
Let's compare and contrast these two errors because there's a pretty large gap and useful information. For a GPU error, all we get is a message that says something went wrong but not much about what or where. But when there's an API usage error, Metal provides a lot of useful information. It shows the API entry point the error occurred on. It shows what type of error it was. In this case, we set an offset larger than the buffers length. There's also a call stack of exactly where the error occurred including line and file information from your code base. Wouldn't it be nice if the GPU errors looked a bit more like the API errors? Today, we'll show you some new tools to help improve the debugging experience of GPU errors. To help illustrate where our new tools fit in, we'll use a debugging workflow of detect, locate classify, and fix. Metal has always had API Validation to help you catch issues early. Finding them early means that they're caught before they can cause problems further down the line. Using it, you can detect when there's API misuse, locate the function causing the issue, and classify the error message. We leave fixing the error up to you. But what about errors in your Metal shading language usage on the GPU? You've seen how these errors appeared on iOS 13 and MacOS Catalina. Metal provides a basic error message. It's enough to tell you that something bad happened during the execution of that command buffer but not much else. So today, Metal's introducing two new diagnostic tools that will help improve the debug workflow, Enhance Command Buffer Errors and the Shader Validation layer. First let's talk about Enhanced Command Buffer Errors.
What is it? Well it enhances your command buffer's errors. To be more specific, it improves the existing command buffer error mechanism by helping you detect and locate execution errors at the encoder level. Here's that GPU error again. There isn't a lot of actionable information here. When you're debugging a command buffer that might have hundreds of encoders, making progress is a lot of work. Here's that same error, but this time, we've turned on Enhanced Command Buffer Errors. It's an obvious improvement over what we had before. You have information about each encoder within this command buffer and that helps you narrow down the failure. Most of our encoders completed their work, but there are a few suspect encoders that had been marked as affected or faulted. That narrows our search down significantly. Enabling Enhanced Command Buffer Errors is simple. All you have to do is create your command buffer with the new descriptor based API and set the error options to encoder execution status. That's it. If an error occurs while the feature is enabled, you can get encoder level information about that error. Here, in our code example, we're using the encoder info error key to access the user info data of this error. This is where we'll find the array of our encoder info objects to iterate over. As you can see here, each encoder info object has the label and debugSignposts that you're already using to uniquely identify each command encoder. If you're not already using labels and signposts, now's a great time to start. The error state tells you the status of the command encoder at the time of the fault. Or alternatively, if you don't want to format it yourself, you can just log the whole error. That will print all the information related to the error. The error state property has a few possible values: completed, pending, faulted, affected, and unknown. Faulted is the most important error state because it means that this encoder was directly responsible for the command buffer fault.
Affected could still indicate the faulty encoder but unlike the faulted state, we're not 100 percent sure. A fault on one encoder could have affected multiple encoders that are running in parallel including encoders from different processes. In the rare event where we can't tell the state of an encoder, we'll report the unknown state. There is synergy with the existing GPU tools as well. Since the encoder info objects are in recorded order and use your labels and debug signposts, you can easily associate them with the same encoder and Metal debugger, Metal system trays, and other tools built into Xcode. For example, with this information you can jump right to the relevant encoder in Metal debugger. So when should you turn it on? First off, you should enable it on every command buffer during development QA. That will enhance all of your internal error reports and give you quick feedback on any errors. Since Enhanced Command Buffer Errors are built right into Metal, it doesn't require any auxiliary layers. It's designed so that the API can leverage hardware functionality in its implementation. That makes it into a low overhead feature. Because it is so low overhead, you can even ship your application with the feature enabled. Since it's command buffer specific, you can target what command buffers to enable it on. As you get telemetry and bug reports, you can tune the set of command buffers to hone in on the problem. That said, test your performance before enabling on user devices. The performance impact varies across devices and workloads. So you want to check whether the overhead is acceptable to you. The challenge with debugging Metal shader code is that the code base can be large and can contain a lot of places for errors to occur. The first step is knowing where to look and Enhanced Command Buffer Errors helps with that. It can get us to the encoder level, but to go deeper Metal provides another tool. That's where Shader Validation comes in -- to detect, locate, and classify the error at the draw call level to help you debug and fix it. So let's talk about the Shader Validation layer and what it can do for you. It's a layer similar to API Validation layer but running on the GPU. It instruments your Metal shaders to detect logical issues as well as locate and classify them. When it detects that an operation would have caused undefined behavior, that operation is prevented and the log is created that can be used to locate the draw call Metal function, possibly even the line in the shader causing the error. This tool can help you debug issues that cause command buffer errors and it can help you detect ones that don't. This is important because there's many types of errors that don't actually cause a command buffer to fail but are still undefined behavior. Let's walk through one of those cases now.
We'll start by allocating two buffers: A and B. We want to read from buffer A but have a logic issue such that it causes us to read out of bounds. What happens next is undefined and depends on Metal's allocation behavior. You could get lucky and there's unallocated memory in between the two buffers. If you go out of bounds in this case you can get a command buffer fault. Since there is a fault, it's obvious feedback that something bad has happened and Enhanced Command Buffer Errors will narrow down to the encoder. But if you're unlucky, out of bounds access won't cause a command buffer fault. Metal may place the buffers one after another in virtual memory with no unallocated space between. Here, our logic error won't cause a command buffer fault. We still go out of bounds but end up landing in another allocation and either read the wrong data or corrupt another buffer. Such issues can be hard to detect and frustrating to debug as they may appear intermittent. The most important thing you should take from that example is that you should always test with API Validation and Shader Validation before shipping. Just because you're not seeing a command buffer fault does not rule out that you don't have any undefined behavior. Undefined behavior isn't always obvious and it can appear intermittent. But the good news is that Shader Validation is meant to detect these cases including the ones that aren't obvious. Let's go over what Shader Validation can and cannot detect. It can detect out of bounds device and constant memory access, out of bounds threadgroup memory access, and attempting to use texturing functions on a null texture object. This doesn't cover all of the common issues mentioned but for everything else Enhanced Command Buffer Errors can help.
You won't get draw information but it will narrow down to the encoder.
The most powerful way to use this feature when debugging is from within Xcode and enabling it is easy. First, bring up the scheme settings in your project. In the diagnostics tab, we have a new section for diagnostics specific to Metal.
Checking the box next to Shader Validation will enable the layer and Enhanced Command Buffer Errors for all command buffers. Once the layer's enabled, you still need to enable the Metal diagnostics breakpoint. The Metal diagnostics break point tells Xcode to stop the execution of the program when a shader validation error occurs and to show the recorded GPU and CPU backtrace for that error.
Clicking the arrow to the right of Shader Validation will add the breakpoint.
Once the break point has been added, you can find it in the debug navigator on the breakpoints tab. You can view the settings of this breakpoint by clicking on the blue arrow. That will bring up this interface where you can customize the breakpoint. To configure the breakpoint for Shader Validation, first, make sure the breakpoint is enabled. Then, set the type to System Frameworks and enter Metal Diagnostics into the category field. At this point, you're ready to use a feature within Xcode. Now let's jump into a demo showing it in action.
We're using the Metal Performance Shader Ray Tracing sample code. We've introduced an easy to make GPU error into the sample for this demo. During the demo, we'll go through using Shader Validation to detect and debug this issue. First, I'll start by launching the app without Shader Validation.
That doesn't look quite right. There's missing shadows and a bunch of lines on the screen. Why this isn't rendering isn't obvious, though. We're not getting any command buffer errors so we don't know which encoder or Metal function has the bug. Before we start trying to debug this line by line, I'll use Metal's new debugging workflow by enabling Shader Validation. First, I'll bring up the scheme settings in my project and then I'll go into the diagnostics tab and then down at the bottom there'll be options for API Validation and Shader Validation. The API Validation has been moved from a different tab to this one. Now I'll enable API Validation and Shader Validation.
Since I want to have Xcode break on the first validation error, I'll click this arrow to add the Metal diagnostics breakpoint. Now I'm setup to use shader validation and I'll relaunch the application. So we have some logs being printed in the console from shader validation indicating that it detected an error. Since I have the breakpoint enabled, Xcode has stopped my application and brought up the Metal shader where the error occurred. Xcode is also showing a shader annotation on the line that Shader Validation found had an error. I can click the shader annotation and it'll show you some more details about this error. Based on the annotation, I'm hitting an out of bounds memory read. Looking at this expression there's only one memory access going on. We're reading the max distance field from the shadow ray argument. Which is a pointer in device memory. There are two possibilities here: either shadow ray is null or shadow ray points to invalid memory. Since we enabled the API Validation, that would have called a null buffer binding so we can rule that one out. Just looking at the function here, it's not clear how the address or shadow ray is being calculated. So we'll use the GPU backtrace view in the bottom left hand side of Xcode. This view shows the GPU backtrace of the error which has the recorded call stack of the error at the time the error occurred. We can traverse this call stack just like you would any other recorded call stack. I'll click on the stack below our function which will jump me to the call site of the function shadow ray intersection. It looks like the variable shadow ray is what's being passed in, which is computed by taking the shadow ray's argument and indexing it using the ray index variable. Since we suspect an invalid offset, we need to investigate ray index. Looking at the comment above the computation of Ray Index its code intends to convert a 2D grid coordinate into a 1D array coordinate. That's typically accomplished by multiplying the grid Y with the grid width and then adding the grid X. However, looking closely at this expression we see that instead of multiplying the grid Y and the width, we're multiplying the X and the width. That's definitely a typo. So let's correct that and then rerun the application.
Now our app is fixed. With the help of Shader Validation and API Validation, we were able to quickly locate and classify this issue. We realize you're not always able to run everything under Xcode. So with some additional setup you can use shader validation without Xcode. That lets you use Shader Validation for use cases like automated testing. Similar to API Validation, Shader Validation can be enabled using two new environment variables we've added to the new macOS and iOS 14. These variables must be set before any Metal device is created for that process. Once a device is created we latch their values so any changes to them after that point will not have an effect. To enable API Validation, set MTL_DEBUG_LAYER to any non-zero value. And to enable Shader Validation, set MTL_SHADER_VALIDATION to any non-zero value. Both of these can be set at once or used independently. The command buffer now has a new log's property, which allows you to retrieve the details for any validation errors that occurred. The first thing to note is that the logs property is only valid after a command buffer finishes. For that reason, we're doing all of our work inside the completion handler. We'll walk through this code sample showing how to use the new API and what information it provides.
Each command buffer can have multiple Shader Validation errors. So we're going to iterate through all of them. Every log object contains information about the Shader Validation error. Like the label of the encoder that had an error, that will give you the label but there can be more information. If your Metal library was compiled from source or was compiled with debug symbols, each log may also have a debug location property. This property is the GPU stack frame containing the error and it will hold the file URL and line of the faulting expression. Alternatively, you could just use the description property. This contains all the same information formatted in an easy to read string.
You'll also be able to find this information in the system log. You can access this log by running this highlighted command in your terminal.
When a validation error occurs, it'll show up like this. The first thing in the log is the process name the error is occurring from. The next will be the type of error and then the error details. Finally, the name of the Metal file and the line information. We have some tips to help you get the most out of Shader Validation. You can expect pipelines to take a bit longer to compile. Because of that, you should really be using the asynchronous compilation methods. That will paralyze compilation across multiple threads which will help mitigate the increased load times during development. You should also enable debug symbols when compiling your Metal libraries. That should automatically happen if you're using a debug scheme in Xcode. But if you're invoking the Metal frontend manually, symbols can be enabled by adding the -g flag. If any of your libraries are compiled from source online, debug symbols will automatically be enabled.
If you are compiling libraries online, we recommend using the line preprocessor directive. The backtracer report uses the file name to identify a shader. Offline compiled Metal files include this information automatically, but it's missing when compiling from source at runtime. You can manually add the file name information by using the line directive to tell the compiler what file it is sourced from or to provide a useful identifier. Due to the nature of its instrumentation, there are a few things to be aware of when enabling Shader Validation. Shader Validation is a process-wide switch that when enabled causes all Metal commands, including UI rendering, to go through the Shader Validation layer. Unlike Enhanced Command Buffer Errors, using shader validation does have a high performance and memory impact. We recommend enabling this feature when development and during QA but not for users because of this impact. Enabling the feature may also change some queries to return different values. In particular, you should always check the maxTotalThreadsPerThreadgroup and the threadExecutionWidth properties of a compute pipeline state as these two may change when Shader Validation is enabled. We support some level of customization on how this feature behaves, such as disabling specific checks. For example, if you're already doing null texture checks, you can safely disable texture usage instrumentation by setting the environment variable MTL_SHADER_VALIDATION_TEXTURE_USAGE = 0.
While disabling some instrumentation can improve runtime and compile time performance, it's at the cost of no longer detecting some possible issues.
More information about what flags are supported can be found at the new Metal Validation man page. Some features are not supported when using Shader Validation. Binary function pointers and dynamic linking are not supported.
There is an additional limitation for MTLGPUFamilyMac1 as well as MTLGPUFamilyApple5 and older devices, which is that global memory access of pointers coming from an argument buffer are not checked. Thank you very much for coming to our session about the two new Metal debugging tools we've added this year. First we cover Enhanced Command Buffer Errors which is a low overhead in-framework tool that helps you detect and locate your faulting encoders in multiple environments like during development and QA or even after you've shipped.
And we just covered Shader Validation which helps you detect, locate, and classify both subtle and obvious shader errors during development and QA. Now go out and try the features. Test your apps with Enhanced Command Buffer Errors and Shader Validation. Thanks and
have a great WWDC.