Learn how Dynamic Caching, the next-generation shader core, hardware-accelerated ray tracing, and hardware-accelerated mesh shading of Apple family 9 GPUs can improve the performance of your Metal apps and games.
Welcome, my name is Jedd Haberstro, and I'm an Engineer in Apple's GPU, Graphics, and Displays Software Group.
I'm excited to tell you about the new Apple family 9 GPU architecture in A17 Pro and the M3 family of chips, which are at the heart of iPhone 15 Pro and the new Macs.
Across Apple's product line, the GPU powers many of the rich user experiences our customers love. Whether it be gaming on the go with the new iPhone 15 Pro, delivering silky smooth UI animations for your apps on the new iMac, or leveraging machine learning to perform advanced video and image processing on the new MacBook Pros, the GPU plays a critical role in enabling these apps.
The Metal API is used to harness the computing capabilities of Apple's GPUs, and collectively, these apps run a diverse set of Metal Shading Language programs. These shader programs can range from small, simple shaders that execute just a handful of lines of code to large complex shaders that spend hundreds of thousands of lines of code, frameworks, and libraries. What all these shaders share in common is massive data parallelism, which is the opportunity to greatly improve the performance of an app by running it in parallel. This parallelism is achieved by running Metal shader programs many times over in parallel on different inputs such as each vertex in a 3D rendered scene or each pixel of the screen.
It is the GPU, ultimately, which is responsible for executing these shaders in parallel.
At the heart of every GPU are its shader cores. Each shader core can run thousands of threads in parallel. And to scale performance even further, a GPU will have many shader cores that can also run in parallel, giving an app tens of thousands of parallel threads of execution.
Already, the GPUs in today's existing iPhones, iPads, and Macs, have incredible performance.
As well as a powerful suite of developer tools to allow app developers to maximize the GPU's potential. But with the new Apple family 9 GPUs, we are increasing performance to unprecedented levels thanks to several new exciting advancements.
First is a brand new shader core architecture that improves the performance and power efficiency of your existing apps, which, right away, benefits the experience you deliver while also meeting the demanding challenges of the next-generation of apps that you will build.
Hardware-accelerated ray tracing transparently benefits apps that already use Metal's ray tracing APIs, as well as expands the opportunities to use ray tracing to achieve rich rendering effects with great performance.
And with hardware-accelerated mesh shading, apps can build advanced geometry processing pipelines like never before.
Before we discuss these in more detail, let's look at the incredible performance apps are already able to achieve with no changes on the new Apple family 9 GPUs. This is "Baldur's Gate 3" by Larian Studios, running on the new MacBook Pro with M3 Macs pictured on top, and a MacBook Pro with M2 Macs pictured on bottom. Each rendering with ultra video quality settings at 1800p. The M3 Macs is able to deliver significant performance improvements, thanks to the next-generation shader core's ability to run the game's Metal shaders with higher thread occupancy. Here is Blender rendering an image of a barbershop scene using the Cycles Path Tracer, which leverages Metal Ray Tracing on M3 Macs. Both Renders were started at the same exact time, but thanks to hardware-accelerated ray tracing and the next-generation shader core, the Render on the M3 Macs converges significantly faster.
This is a real-time visualization of the "Toy Story 4" Antiques Mall USD rendered by Pixar's Hydra Storm. Hydra Storm uses Metal mesh shading on M3 Macs, which when combined with hardware-accelerated mesh shading runs faster than ever before.
Let's now look at each of these features in more detail, starting with the next generation shader core.
The Apple family 9 GPUs are composed of several building blocks such as compute and vertex command processors that parse your Metal command buffers, a rasterizer that dispatches fragment shaders for execution, and a hierarchy of caches, including the GPU last level cache that services all GPU memory traffic.
But central to any GPU are its shader cores. These are the building blocks that execute your app's Metal shaders.
The shader core is also paired with a texture unit that can sample and write your texture resources, as well as a brand new ray tracing unit that accelerates ray intersection requests.
A shader core can be further subdivided into its constituent parts.
Each shader core has an array of execution pipelines that execute different types of instructions, such as FP32, FP16, and Integer math, which correspond to operations on variables in your shaders with Metal data types such as float, half, and int, as well as memory pipelines for read and write operations to textures and buffers.
Keeping all of these execution pipelines busy usually requires executing instructions from multiple SIMDgroups. So there's a pool that keeps track of the SIMDgroups that are running on a shader core, and a scheduler that chooses which SIMDgroup to execute instructions from next.
Typically, there's also a handful of on-chip memories for storing the different types of data a shader program may use, such as registers for storing the values of variables, threadgroup and tile memory for storing data shared across a compute threadgroup or color attachment data shared across the tile, and a cache to improve the performance of accesses to the stack and to buffers.
With this understanding of what a shader core does and its constituent pieces, I'd like to explain three new exciting advancements in the Apple family 9 GPU shader core. These advancements will increase the performance of your shaders with no changes to your app. But by better understanding how this new shader core works, you'll be able to benefit from it to an even greater degree.
The first change is dynamic shader core memory, which allows an app to achieve better thread occupancy, and as a result, often better performance.
The second change is the flexible on-chip memory. This will increase the efficiency by which your shaders access buffer, stack, threadgroup, and tile memory.
The last change is the shader core's high-performance ALU pipelines, which have increased their ability to execute in parallel. This will improve the performance of apps that perform a combination of floating point or integer math.
Before further exploring these new features, let's dive into more detail about how a shader core keeps its execution pipelines busy and the importance of thread occupancy in this endeavor.
Suppose your Metal shader, after executing some math operations using the ALU pipelines, reads a buffer whose result will be used immediately after. Accessing the buffer may require going all the way to device memory, which is a long latency operation. During this time, the SIMDgroup can't execute other operations, which causes the ALU pipelines to go unused.
To mitigate this, the shader core can execute instructions from a different SIMDgroup, which may have some ALU instructions of its own. This reduces the amount of time the ALUs go and used and allows the SIMDgroups to run in parallel, thus improving performance.
If there are additional SIMDgroups running on the shader core, this can be done many times over until the ALUs and other execution pipelines are never starved of instructions to execute.
The number of SIMDgroups that are concurrently running on a shader core is called its thread occupancy.
But you may be asking yourself, what dictates how many SIMDgroups will be running concurrently on a shader core? To answer that question, let's look at an example. This is a prototypical ray tracing compute kernel that intersects a ray with an acceleration structure, inspects the intersection result, and then executes a different shading function based on the material of the primitive intersected. In this example, it supports shading both glass and leather materials.
Each line of code will use some amount of registers to store the program's variables. At different points of the program, more or fewer registers will be used depending on what the code does. In this particular example, the implementation of the shadeGlass function uses many more registers than the rest of the program.
Prior to the Apple family 9 GPU, a SIMDgroup could not begin execution on a shader core until it allocated registers from the on-chip register file. The amount allocated would be equal to the maximum register usage at any point in the program. The SIMDgroup would keep that many registers allocated for the entire duration of the SIMDgroup, even though most of those registers may go unused in large sections of the program. Thus, based on the maximum register usage, we may only be able to run, for example, four SIMDgroups at a time on a shader core because any more would require more on-chip register filed memory than exists. However, thanks to the Apple family 9 GPU's new dynamic shader core memory feature, the maximum register usage no longer dictates how many SIMDgroups can be run. On-chip register memory is now dynamically allocated and deallocated over the lifetime of the shader according to what each part of the program actually uses. This allows SIMDgroups to make much more efficient use of the on-chip register file, freeing up space that would not have been available otherwise. This can have a profound impact on your app's thread occupancy, and ultimately, its performance by allowing many more SIMDgroups to run concurrently.
As I just mentioned, registers are now dynamically allocated and deallocated over the course of a SIMDgroup's lifetime. This is in part possible because the register file is now a cache instead of the permanent storage for the registers, meaning more registers can be used that can be stored on chip.
The flexible on-chip memory feature extends this treatment to the rest of the shader core's memory types, such as threadgroup and tile memory, making that a cache too.
And now that register, threadgroup, tile, stack, and buffer data are all cached on chip, this has allowed us to redesign the on-chip memories into fewer larger caches that service all these memory types. This flexibility will benefit shaders that don't make heavy use of each memory type. In the past, if a compute kernel didn't use, for example, threadgroup memory, its corresponding on-chip storage would go completely unused. Now, the on-chip storage will be dynamically assigned to the memory types that are used by your shaders, giving them more on-chip storage than they had in the past, and ultimately, better performance.
For example, for shaders with heavy register usage, that may mean higher occupancy.
For shaders that repeatedly access a large working set of buffer data, that will mean better cache hit rates, lower buffer access latency, and thus, better performance. And for apps that make heavy use of non-inline functions, such as function pointers, visible function tables, and dynamically linked shader libraries, this means more on-chip stack space to pass function parameters, and thus, faster function calls.
But what happens if your app still uses more memory than there is on-chip storage for? Unmitigated, that data will spill to the next cache level or even to main memory. Fortunately, the shader core will dynamically monitor your shaker's behavior and adjust the occupancy level to prevent this from occurring. This keeps data on chip. And ultimately, the execution pipelines busy.
This does mean, however, that your shader's occupancy will be impacted by how your shader's access threadgroup, tile, stack, and buffer memory in addition to its dynamic register usage.
These new hardware capabilities improve the occupancy of many apps, meaning you, the developer, need to optimize occupancy a lot less often than in the past. But if you do need to optimize occupancy further on Apple family 9 GPUs, we have developed a suite of profiling tools to help you. To learn more about how to diagnose and optimize occupancy, please refer to these talks.
The last feature of the Apple family 9 GPU shader core I'd like to discuss is its high-performance ALU pipelines.
Apple GPU shader cores have separate ALU pipelines for different instruction types, including FP16 instructions. Apple GPUs are highly optimized to execute FP16 arithmetic. And we recommend that you'll use FP16 data types wherever possible.
FP16 math instructions execute at peak throughput.
They use fewer registers than their FP32 equivalents. They reduce memory bandwidth if your buffers store data natively in FP16. And for situations where the source or destination variable of a math operation is not FP16 already, it can be converted to and from at no cost.
But if your app still performs other math operations, such as FP32 and integer, the Apple family 9 GPU shader core can execute instructions from all three data types in parallel to a greater degree than ever before. This can deliver up to 2x ALU performance compared to prior Apple GPUs. In order to take advantage of this extra parallelism, instructions must be executed from multiple SIMDgroups, which means increasing occupancy can improve the utilization of the ALU pipelines. Let's consider an example. Imagine there are two SIMDgroups running concurrently, both executing ALU instructions. In the past, these SIMDgroups may have had to run one after another.
But if they have FP32 and FP16 instructions to execute at different points in time, as depicted here, then their executions can be overlapped, increasing parallelism and performance.
To recap what's new in the next-generation shader core, it will dynamically allocate and deallocate registers over the lifetime of a shader, which improves its thread occupancy.
It has a large on-chip cache that services registers, threadgroup, tile, stack, and buffer memory, which improves the performance of accessing those memory types.
The shader core will dynamically adjust occupancy to keep data on chip and the execution pipelines busy.
And finally, FP16, FP32 and integer operations can execute in parallel more than ever, increasing ALU performance.
Next, let's take a look at hardware-accelerated ray tracing.
With Metal ray tracing, apps can leverage the massive parallelism of Apple GPUs to intersect rays with their scene geometry. If you're not familiar with Metal ray tracing and would like to learn more, please watch Your guide to Metal ray tracing and Enhance your app with Metal ray tracing.
At the heart of the Metal ray chasing API is the intersector object that is responsible for determining the intersection point of a ray with the primitives contained in an acceleration structure. It is often invoked many times over by ray tracing app's GPU functions, also known as shaders, and thus, is central to the app's performance.
Earlier, I showed such AGPU function when I looked at the register usage of this raytracingKernel. It creates an intersector object and finds an intersection by calling the object's intersect method.
To determine the intersection point, the intersector performs a few key stages. First, it traverses the acceleration structure to find a candidate primitive. It then invokes an intersection function, which may be provided by the app, to determine if the rate intersects the primitive.
If it does, the intersection is compared to previous intersections and the process is repeated until the closest is found.
The closest intersection is then returned to the calling GPU function for further app-specific processing.
New in Apple family 9 GPUs, the implementation of the intersector object is hardware-accelerated, which greatly increases the performance of this critical operation.
The hardware-accelerated intersection does not execute in line with the GPU function. Thus, to facilitate the communication of the ray and the ray payload between the two, data is read and written to on-chip memory, which you can observe using the RT scratch performance counters in the new Xcode.
Now that I've discussed the role and responsibilities of the intersector, let's dissect the performance characteristics of onetime through this intersector loop using an example.
Imagine our app is executing two SIMDgroups that each wish to intersect four rays with an acceleration structure.
In this example, our acceleration structure contains the classic kernel boxing, with one box object and one sphere object.
The rays are cast into the scene by calling the intersect method, passing it the ray, the acceleration structure, and the intersection function table. Each SIMDgroup has two rays that intersect the box and two the intersect the sphere. In this example, the box is defined as opaque triangle primitives by using the MTLAccelerationStructure TriangleGeometryDescriptor and setting its opaque property to yes, thus, the intersection can compute the intersections using Metal's built-in intersection function.
However, the sphere is defined procedurally using a custom bounding box intersection function that the intersection must invoke.
The custom BoundingBoxIntersection function is declared using the intersection attribute with the bounding_box parameter.
As I mentioned before, the intersect method is called by each thread that is testing a ray against the acceleration structure. So with this example in mind, let's look at how each intersect calls traversal and intersection test are executed in a traditional implementation.
In typical usage, not all traversals will take the same amount of time to locate a primitive to test the ray against. This creates what is called execution divergence, which causes each thread in a SIMDgroup to wait for the longest traversal from that SIMDgroup before proceeding to the next stage.
And as it turns out, the same overhead compounds when executing the intersection functions too. Execution divergence causes each type of intersection function to run one after another, further reducing parallelism. An aggregate across both stages, each thread spends a large proportion of its runtime idle, waiting on the other threads in the SIMDgroup to complete, which is a major performance bottleneck.
With that picture of a traditional implementation in mind, let's discuss how hardware-accelerated ray tracing optimizes those inefficiencies.
The first major improvement is that the hardware intersector is able to run each traversal completely independently using fixed function hardware. This is possible in part because the arrays are sent to the hardware intersector for processing instead of executing in line with the GPU function. This greatly decreases the time spent traversing and also removes the overhead of the traditional traversal's execution divergence.
On the other hand, the intersection functions are Metal shading language code, so they still must be grouped into SIMDgroups to be run on the shader core. However, because the hardware intersector executes each ray independently, it is free to group together the intersection function calls from rays that originated from separate SIMDgroups.
This is the role of the reorder stage. When rays reach this stage within close proximity and time, the intersection function calls will be grouped into coherent SIMDgroups, such that the execution divergence overhead present in the traditional implementation is reduced or even completely eliminated.
So now that I've shown you how hardware-accelerated ray tracing improves the performance of your app's ray intersector calls, let's review some best practices that your apps can implement to maximize its benefits.
Our first suggestion is to use the intersector object API whenever possible. Metal also allows ray tracing to be performed using the intersection query API, but this API increases the amount of ray trace scratch memory that must be read and written, as well as disables the reorder stage.
We also recommend when authoring custom intersection functions to avoid creating one uber function that is capable of executing many different logical intersection routines. Instead, create one Metal intersection function for each logical intersection routine. This increases the benefits of the reorder stage.
It is also important to try to minimize the size of the ray payload structure that has passed to and return from the intersector object. This will decrease your shader's latency and potentially increase its thread occupancy.
For more details and guidance about how to optimize your ray tracing apps, please watch these talks.
To recap, the Apple family 9 GPUs greatly improved the performance of ray tracing through new hardware acceleration that features fixed function traversal blocks and an intersection function reorder stage.
And although this new hardware will improve the performance of all Metal ray tracing apps, to maximize the benefits your app derives from it, it's best to use the intersection API instead of the intersection query API whenever possible.
The last advancement in the Apple family 9 GPUs that I'd like to talk to you about is hardware-accelerated mesh shading.
Mesh shading is a flexible, GPU-driven geometry processing stage in the rendering pipeline that replaces the traditional vertex shader stage with two compute-like shaders.
Object shaders execute in the first stage and can be used to perform coarse grain processing of app-specific inputs such as entire mesh objects. Each object threadgroup can choose to spawn a mesh group to perform subsequent finer grain processing. Mesh shaders comprise the second stage. Typically, a mesh threadgroup will process a constituent piece of the parent object, often referred to as a meshlet.
The output of the mesh threadgroup is a Metal mesh object that encapsulates a list of vertices and primitives to be processed by the remainder of the traditional graphics pipeline.
Mesh shading has numerous applications, such as fine-grained geometry calling, procedural geometry generation, custom app-specific geometry representations, such as compressed formats. And for porting geometry and tessellation shaders from other graphics APIs.
If you're unfamiliar with mesh shading in Metal, I recommend that you check out the two talks below.
With hardware-accelerated mesh shading on Apple family 9 GPUs, the most notable improvement you'll observe is much improved performance of your existing mesh shading code..
Apple family 9 GPUs are able to much more efficiently schedule object and mesh threadgroups to keep intermediate meshlet data on chip. Thus, reducing memory traffic.
With the new hardware also comes several Metal API enhancements. The first is support for encoding draw mesh commands into indirect command buffers. This allows GPU-driven rendering pipelines to make use of mesh shading in addition to traditional vertex shaders.
The second API enhancement expands the maximum number of threadgroups per mesh grid from 1,024 to over 1 million.
Let's now review a couple of best practices to ensure optimal mesh shading performance.
The metal::mesh object output by a mesh threadgroup has several template parameters whose size are important to keep as small as possible.
For the mesh's vertex and primitive data types, this can be done, for example, by removing unused attributes that may be present due to sharing those data types with other unrelated vertex or mesh functions. The mesh type must also specify the maximum number of primitives and vertices that may be output. These should not be set any larger than what your app's geometry, pipeline, and assets actually need. Being mindful of these sizes will reduce memory traffic and may increase occupancy.
If performing per primitive calling in a mesh shader, we don't recommend writing vertex positions to the mesh object just to be called by the hardware subsequent calling stage. Instead, it is best to completely omit writing such primitives as that can save substantial processing time in the remainder of the hardware's geometry processing stages.
All right, let's recap what I covered about the Apple family 9 GPUs.
The next-generation shader core increases on-chip memory utilization for better thread occupancy and performance by dynamically allocating register storage and sharing on-chip memory across many memory types.
Hardware-accelerated ray tracing greatly improves the performance of apps using the Metal ray tracing APIs, enabling new high-fidelity visual effects. And finally, mesh shading performance is greatly improved thanks to hardware acceleration, enabling more apps to customize their geometry processing pipeline.
Thank you for watching.
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.