The seamless integration of Metal 2 with the A11 Bionic chip lets your apps and games realize entirely new levels of performance and capability. Get introduced to powerful new API features and GPU-driven capabilities of Metal 2 on A11, including imageblocks, tile shading, enhancements to raster order groups, imageblock sample coverage control, and threadgroup sharing. Understand the architecture of the Apple-designed A11 GPU and see how it creates opportunities for advances in rendering, compute, and machine learning techniques.
Metal 2 introduces a new set of APIs and shading language changes to take advantage of the architecture and new features of the A11 GPU. Let us review what is new with Metal 2 on A11. Apple designed Metal to enable rapid innovations in GPU architecture. And in turn, the Apple GPU architecture has informed the design of Metal. This deep and seamless integration of hardware and software enables exciting new possibilities for your graphics, compute, machine learning apps, and games.
Just three years after the introduction of Metal, we introduced Metal 2, the next generation of Metal, at WWDC 2017. Building on a clean and well-factored design, Metal 2 expands to include even more cutting edge ways to access the capabilities of the GPU, such as GPU-driven rendering, which allows the GPU to dispatch graphics workloads to itself, further increasing efficiency and decrease the draw call cost by up to 10X. In 2015, Metal expanded to support the Mac and desktop GPUs. Now, Metal 2 aligns the API to expose key features uniformly, regardless of the underlying GPU architecture. With the rise of machine learning across a wide variety of domains, Metal 2 brings a wider and more sophisticated set of functions that are aimed at accelerating inference operations for improved performance and efficiency. Metal 2 also brings a new set of optimization tools that make it far easier for you to expertly tap into the power of the GPUs on Apple platforms. And now, we can reveal more Metal 2 capabilities that we didn't announce at WWDC. Metal 2 includes a set of powerful new features that expose the unique capabilities of Apple-designed GPU in our latest A-series chip, the A11.
Before we go into the details of A11 GPU architecture and features, let us review the architecture of a classical GPU and tile-based deferred rendering architecture. This is a simplified diagram of a classical GPU architecture. GPUs are massively parallel machines. The vertex and fragment stages that are shown in this diagram are replicated many times, and they run in parallel. There are also many optimizations such as cache hierarchies, FIFOs, early coarse depth test, and so on, that are not shown in this diagram. Fundamentally, GPUs with classical architecture take primitives and generate depth, color, data buffers, and textures. One of the defining characteristics of this architecture is that the output of the vertex stage feeds directly into the fragment stage. Let us look at the tile-based deferred rendering architecture, which is also known as TBDR. All A-series GPUs are based on TBDR architecture. TBDR makes some significant changes to the classical GPU architecture. The first major difference is that the vertex stage is not fed directly into the fragment stage. Instead, as they come out of the vertex stage, the primitives are binned into screen-aligned small tiles and stored into the memory.
This change allows the vertex stage to run asynchronously relative to fragment stage. While running the fragment stage of a renderpass, in parallel, the hardware executes the vertex stage of a future renderpass. Running vertex stage asynchronously provides significant performance improvements. The vertex stage is usually making heavy use of fixed-function hardware, whereas the fragment stage is a heavy user of math and bandwidth. Completely overlapping them allows us to use all the hardware blocks on the GPU simultaneously. Having primitives binned to tiles allows us to process all primitives in a tile all together. Let us see how we can take advantage of that. We put tile-sized, full resolution, depth, stencil, and frame buffers on the chip next to our shader cores. We call this memory tile memory. There are three important characteristics of the tile memory. First, the bandwidth between the shader core and tile memory is many times higher than the bandwidth between the GPU and the external memory, and scales proportionally with the number of shader cores.
Second, the memory access latency to the tile memory is many times lower than the latency for the accesses to the external memory. Finally, tile memory consumes significantly lower power than the external memory. TBDR uses this low-latency, low-power-consumption, high-bandwidth memory for two major optimizations. First, the tile depth/stencil memory allows the hardware to generate full depth and stencil buffer information for opaque objects before the shading core starts to process them, which enables the hardware to perfectly cull fragments that are occluded before sending to the shader core. If the depth buffer is not needed for the subsequent renderpasses, the full-size depth buffer can be entirely eliminated through the use of memoryless render targets, saving a large amount of memory bandwidth, storage, and power. Second, tile memory is used for storing the color buffers on the chip. Blending operations are fast because they do not need to access the full-sized frame buffer on the external memory. Tile memory is written only once, after the entire tile is processed, and saves significant amounts of power, performance, and bandwidth. Higher occupancy is achieved thanks to this faster memory. The framebuffer fetch feature allows you to implement custom blending and enables several advanced techniques. Combined with memoryless frame buffers, many of these techniques do not need to consume external memory either. As a result, TBDR brings great performance even when bandwidth is limited. TBDR consumes much lower power, which is essential for battery-powered devices.
Let us now switch gears to A11 GPU. On A11, the first major change we made to the GPU architecture is to give you direct control of data residing in tile memory from your fragment functions. Imageblocks provide optimized access to image data residing in tile memory. You'll be able to lay out pixels in the way that makes sense to your application, yet can still be rendered to efficiently. An imageblock is a 2D data structure in tile memory. You can specify its width, height, depth, and format. Metal 2 adds texture pixel formats to the shading language to give you full control over the pixel layout through the packed data types. The second major architectural change gives you access to all pixels that are stored in tile memory at the same time.
Tile shading is the new programmable stage in Apple's A11 GPU that provides compute capabilities inline within render passes. Tile shading enables a whole new level of performance and efficiency in Metal 2. Rendering and compute operations can now share the data through the higher bandwidth, lower latency, and lower power tile memory. Tile shading is deeply integrated with imageblocks. You will be able to analyze imageblock contents, summarize that content, store imageblocks mid-scene, and even change imageblock layouts. You can also use threadgroup memory just like a regular compute kernel would. For tile shaders, the threadgroup memory is persistent. Each successive invocation of tile shader can operate on the threadgroup memory, starting with the values that are left from the previous tile shader. This is true for imageblock memory as well. They are persistent between invocations of tile and fragment shaders. Additionally, we are introducing an advanced version of raster order groups that supports imageblock and tile shading. And finally, we are extending the Metal shading language to give you full control over sample coverage for multi-sampled imageblocks. Let us see a set of rendering techniques that can take great advantage of the new architecture and the new Metal 2 features.
Tile shaders, imageblocks, and raster order groups are a great way to combine interleaved, render, and compute passes into a single combined pass. Deferred rendering and tiled forward rendering could be accelerated this way. Let us look at the tiled forward implementation as an example. You can pass geometry to create on-chip depth information first, then run a tile shader to create per tile min-max depth information, run another tile shader to create a culled light list in threadgroup memory, and then run your material shaders. All of these operations can be done in one combined pass, increasing performance by eliminating large amounts of bandwidth, storage, and power. These features also enable efficient implementations of order-independent transparency, multi-layer alpha blending, and sub-surface scattering. Sample coverage control, tile shaders, and imageblock enable much more efficient ways of doing custom MSAA resolves, MSAA tone mapping, and surface aggregation. To show how some of these use cases can be accelerated, we are releasing sample code for deferred rendering, tiled forward, multi-layer alpha blending, and surface aggregation.
Metal 2 on A11 advances the TBDR architecture by introducing imageblocks, tile shaders, imageblock sample coverage control, and raster order groups. Additionally, we introduced new Metal shading language changes to give you new and efficient mechanisms to share data between your compute threads and threadgroups. Let us briefly review these and other additional features and performance improvements on A11.
Let us start with imageblocks. An imageblock is a 2D data structure in tile memory. Fragment functions can only access a single pixel that corresponds to its location, whereas kernels can access the entire imageblock. Each pixel can be quite complex, consisting of multiple components, and each component can be addressed as its own image plane. Imageblocks also provide bulk access to the GPU's format conversion hardware. Floating point pixels will be converted to the destination texture format when stored to device memory. Tile shaders provide compute capabilities inline within render passes. Tile shaders can access the entire imageblock, and just like regular compute kernels, they have support for threadgroup memory. Unlike a threadgroup memory of a compute kernel, the threadgroup memory of a tile shader persists across the lifetime of a tile, just like color data persists across draws. So where before you were limited to communicating across draws within the scope of a pixel using the framebuffer fetch feature, you can now communicate between the tile dispatches and fragment draw calls using the wider tile scope. Let us now look into how A11 improves MSAA over previous generations. Apple's A-series GPUs have a very efficient MSAA implementation. When the fragment is not an edge fragment, the hardware blending executes once per fragment, not once per sample. Additionally, you can resolve directly from tile memory to the resolve attachment and avoid incurring additional memory bandwidth. Through the use of Metal's memoryless render target feature, you can also eliminate MSAA rendertarget memory storage entirely. With Metal 2 on A11, we took MSAA even further. While our current A-series GPUs already track edges in a pixel, the A11 GPU extends this tracking to an even finer granularity by tracking the number of unique samples within each pixel. This hardware change makes your multisample applications faster without requiring any changes to your application. With A11, Metal 2 also gives you full control of this tracking metadata with imageblock sample coverage control. You can also leverage this feature in combination with threadgroup imageblocks and tile shaders. With imageblock sample coverage control, your tile pipeline can modify the GPU's sample coverage tracking data, allowing you to resolve sample data at any time in a render pass with your own custom resolve algorithm. Raster order groups enable you to access memory from overlapping fragment functions in submission order, and allows fragment functions to communicate. A11 extends the raster order groups' functionality. First, A11 exposes the GPU's internal tile memory. Raster order groups make tile memory more useful by giving you access to it in a predictable order. Second, where raster order groups on other GPUs are limited to only one mutex per pixel, A11 can go finer-grained than that, allowing an even lighter touch and minimizing how often your threads are waiting for access.
Now let us look at how Metal 2 accelerates data sharing between threads and threadgroups. Metal 2 shading language extends atomic functions with memory order and scope attributes. These new additions enable new ways of flexible and efficient sharing of data between threads. Before Metal 2, to communicate between threadgroups required completing the kernel execution and issuing a new kernel to consume the outputs of the threadgroups of the first kernel. On Metal 2, threadgroups can communicate directly with each other. Additionally, with the addition of these new features, threads within a threadgroup can communicate without using a barrier resulting in improved performance.
We also added some other significant features and capabilities to Metal 2 on A11. On A11, f16 math has overall better accuracy through the improvements to rounding and the maximum value handling. A11 adds support for texture cube arrays, and introduces read-write texture functionality. With A11, an array of sampler coverage comes to A-series GPUs. A11 adds post-depth coverage feature and provides a more flexible way of dispatching compute kernels. A11 adds support for quad scoped permute operations as well. For further details about these features, please check the Metal 2 documentation. A11 makes many significant performance improvements to the GPU. It has up to 2X math performance when it comes to tasks for computer vision, image processing, and machine learning. But that is not the only area of improvement for performance. Let us review the improved performance and capabilities of A11 GPU. We doubled F16 math and texture filtering rate per clock cycle when compared to A10 GPU. Please note: On A11, using F16 data types in your shaders when possible makes a much larger performance difference. We doubled the maximum threadgroup size from 512 to 1K on A11. Maximum multiple render target size increased from 256 bits to 512 bits. Maximum threadgroup memory size doubled from 16K to 32K. We also made significant improvement over the performance of feedback operations. They're also known as alpha-test and discard. For more information about Metal 2 and links to the sample code, please visit the developer website at developer.apple.com/metal.
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.