The seamless integration of Metal 2 with the A11 Bionic chip lets your apps and games realize entirely new levels of performance and capability. Get introduced to powerful new API features and GPU-driven capabilities of Metal 2 on A11, including imageblocks, tile shading, enhancements to raster order groups, imageblock sample coverage control, and threadgroup sharing. Understand the architecture of the Apple-designed A11 GPU and see how it creates opportunities for advances in rendering, compute, and machine learning techniques.
Metal 2 introduces a new set of APIs
and shading language changes
to take advantage of the architecture
and new features of the A11 GPU.
Let us review what is new with Metal 2 on A11.
Apple designed Metal to enable rapid innovations
in GPU architecture.
And in turn, the Apple GPU architecture
has informed the design of Metal.
This deep and seamless
integration of hardware and software
enables exciting new possibilities for your graphics,
compute, machine learning apps, and games.
Just three years after the introduction of Metal,
we introduced Metal 2, the next generation of Metal,
at WWDC 2017.
Building on a clean and well-factored design,
Metal 2 expands to include even more cutting edge ways
to access the capabilities of the GPU,
such as GPU-driven rendering,
which allows the GPU to dispatch graphics workloads to itself,
further increasing efficiency
and decrease the draw call cost by up to 10X.
In 2015, Metal expanded
to support the Mac and desktop GPUs.
Now, Metal 2 aligns the API to expose key features uniformly,
regardless of the underlying GPU architecture.
With the rise of machine learning
across a wide variety of domains,
Metal 2 brings a wider and more sophisticated set of functions
that are aimed at accelerating inference operations
for improved performance and efficiency.
Metal 2 also brings a new set of optimization tools
that make it far easier for you to expertly tap into
the power of the GPUs on Apple platforms.
And now, we can reveal more Metal 2 capabilities
that we didn't announce at WWDC.
Metal 2 includes a set of powerful new features
that expose the unique capabilities
of Apple-designed GPU
in our latest A-series chip, the A11.
Before we go into the details
of A11 GPU architecture and features,
let us review the architecture of a classical GPU
and tile-based deferred rendering architecture.
This is a simplified diagram of a classical GPU architecture.
GPUs are massively parallel machines.
The vertex and fragment stages that are shown in this diagram
are replicated many times, and they run in parallel.
There are also many optimizations
such as cache hierarchies, FIFOs,
early coarse depth test, and so on,
that are not shown in this diagram.
Fundamentally, GPUs with classical architecture
take primitives and generate depth, color,
data buffers, and textures.
One of the defining characteristics
of this architecture
is that the output of the vertex stage
feeds directly into the fragment stage.
Let us look at the tile-based deferred rendering architecture,
which is also known as TBDR.
All A-series GPUs are based on TBDR architecture.
TBDR makes some significant changes
to the classical GPU architecture.
The first major difference is that the vertex stage
is not fed directly into the fragment stage.
Instead, as they come out of the vertex stage,
the primitives are binned into screen-aligned small tiles
and stored into the memory.
This change allows the vertex stage
to run asynchronously relative to fragment stage.
While running the fragment stage of a renderpass, in parallel,
the hardware executes the vertex stage of a future renderpass.
Running vertex stage asynchronously
provides significant performance improvements.
The vertex stage is usually making heavy use
of fixed-function hardware, whereas the fragment stage
is a heavy user of math and bandwidth.
Completely overlapping them allows us to use
all the hardware blocks on the GPU simultaneously.
Having primitives binned to tiles allows us
to process all primitives in a tile all together.
Let us see how we can take advantage of that.
We put tile-sized, full resolution,
depth, stencil, and frame buffers on the chip
next to our shader cores.
We call this memory tile memory.
There are three important characteristics
of the tile memory.
First, the bandwidth between the shader core and tile memory
is many times higher than the bandwidth
between the GPU and the external memory,
and scales proportionally with the number of shader cores.
Second, the memory access latency to the tile memory
is many times lower than the latency for the accesses
to the external memory.
Finally, tile memory consumes significantly lower power
than the external memory.
TBDR uses this low-latency, low-power-consumption,
high-bandwidth memory for two major optimizations.
First, the tile depth/stencil memory allows the hardware
to generate full depth and stencil buffer information
for opaque objects
before the shading core starts to process them,
which enables the hardware to perfectly cull fragments
that are occluded before sending to the shader core.
If the depth buffer is not needed
for the subsequent renderpasses,
the full-size depth buffer can be entirely eliminated
through the use of memoryless render targets,
saving a large amount of memory bandwidth, storage, and power.
Second, tile memory is used
for storing the color buffers on the chip.
Blending operations are fast
because they do not need to access
the full-sized frame buffer on the external memory.
Tile memory is written only once,
after the entire tile is processed,
and saves significant amounts of power,
performance, and bandwidth.
Higher occupancy is achieved thanks to this faster memory.
The framebuffer fetch feature
allows you to implement custom blending
and enables several advanced techniques.
Combined with memoryless frame buffers,
many of these techniques
do not need to consume external memory either.
As a result, TBDR brings great performance
even when bandwidth is limited.
TBDR consumes much lower power,
which is essential for battery-powered devices.
Let us now switch gears to A11 GPU.
On A11, the first major change we made to the GPU architecture
is to give you direct control of data residing in tile memory
from your fragment functions.
Imageblocks provide optimized access to image data
residing in tile memory.
You'll be able to lay out pixels
in the way that makes sense to your application,
yet can still be rendered to efficiently.
An imageblock is a 2D data structure in tile memory.
You can specify its width, height, depth, and format.
Metal 2 adds texture pixel formats
to the shading language
to give you full control over the pixel layout
through the packed data types.
The second major architectural change
gives you access to all pixels that are stored in tile memory
at the same time.
Tile shading is the new programmable stage
in Apple's A11 GPU that provides compute capabilities
inline within render passes.
Tile shading enables a whole new level of performance
and efficiency in Metal 2.
Rendering and compute operations can now share the data
through the higher bandwidth, lower latency,
and lower power tile memory.
Tile shading is deeply integrated with imageblocks.
You will be able to analyze imageblock contents,
summarize that content, store imageblocks mid-scene,
and even change imageblock layouts.
You can also use threadgroup memory
just like a regular compute kernel would.
For tile shaders,
the threadgroup memory is persistent.
Each successive invocation of tile shader
can operate on the threadgroup memory,
starting with the values that are left
from the previous tile shader.
This is true for imageblock memory as well.
They are persistent between invocations of
tile and fragment shaders.
Additionally, we are introducing
an advanced version of raster order groups
that supports imageblock and tile shading.
And finally, we are extending the Metal shading language
to give you full control over sample coverage
for multi-sampled imageblocks.
Let us see a set of rendering techniques
that can take great advantage of the new architecture
and the new Metal 2 features.
Tile shaders, imageblocks, and raster order groups
are a great way to combine interleaved,
render, and compute passes into a single combined pass.
Deferred rendering and tiled forward rendering
could be accelerated this way.
Let us look at the tiled forward implementation as an example.
You can pass geometry to create on-chip depth information first,
then run a tile shader
to create per tile min-max depth information,
run another tile shader to create a culled light list
in threadgroup memory,
and then run your material shaders.
All of these operations can be done in one combined pass,
increasing performance by eliminating
large amounts of bandwidth, storage, and power.
These features also enable efficient implementations
of order-independent transparency,
multi-layer alpha blending, and sub-surface scattering.
Sample coverage control, tile shaders, and imageblock
enable much more efficient ways of doing custom MSAA resolves,
MSAA tone mapping, and surface aggregation.
To show how some of these use cases can be accelerated,
we are releasing sample code for deferred rendering,
tiled forward, multi-layer alpha blending,
and surface aggregation.
Metal 2 on A11 advances the TBDR architecture
by introducing imageblocks, tile shaders,
imageblock sample coverage control,
and raster order groups.
Additionally, we introduced new Metal shading language changes
to give you new and efficient mechanisms to share data
between your compute threads and threadgroups.
Let us briefly review these and other additional features
and performance improvements on A11.
Let us start with imageblocks.
An imageblock is a 2D data structure in tile memory.
Fragment functions can only access a single pixel
that corresponds to its location,
whereas kernels can access the entire imageblock.
Each pixel can be quite complex,
consisting of multiple components,
and each component can be addressed
as its own image plane.
Imageblocks also provide bulk access
to the GPU's format conversion hardware.
Floating point pixels will be converted to
the destination texture format when stored to device memory.
Tile shaders provide compute capabilities
inline within render passes.
Tile shaders can access the entire imageblock,
and just like regular compute kernels,
they have support for threadgroup memory.
Unlike a threadgroup memory of a compute kernel,
the threadgroup memory of a tile shader
persists across the lifetime of a tile,
just like color data persists across draws.
So where before you were limited
to communicating across draws within the scope of a pixel
using the framebuffer fetch feature,
you can now communicate between the tile dispatches
and fragment draw calls using the wider tile scope.
Let us now look into how A11 improves MSAA
over previous generations.
Apple's A-series GPUs
have a very efficient MSAA implementation.
When the fragment is not an edge fragment,
the hardware blending executes once per fragment,
not once per sample.
Additionally, you can resolve directly from tile memory
to the resolve attachment
and avoid incurring additional memory bandwidth.
Through the use of Metal's
memoryless render target feature,
you can also eliminate
MSAA rendertarget memory storage entirely.
With Metal 2 on A11, we took MSAA even further.
While our current A-series GPUs already track edges in a pixel,
the A11 GPU extends this tracking
to an even finer granularity
by tracking the number of unique samples within each pixel.
This hardware change makes your multisample applications faster
without requiring any changes to your application.
With A11, Metal 2 also gives you
full control of this tracking metadata
with imageblock sample coverage control.
You can also leverage this feature in combination with
threadgroup imageblocks and tile shaders.
With imageblock sample coverage control,
your tile pipeline can modify
the GPU's sample coverage tracking data,
allowing you to resolve sample data
at any time in a render pass
with your own custom resolve algorithm.
Raster order groups enable you to access memory
from overlapping fragment functions in submission order,
and allows fragment functions to communicate.
A11 extends the raster order groups' functionality.
First, A11 exposes the GPU's internal tile memory.
Raster order groups make tile memory more useful
by giving you access to it in a predictable order.
Second, where raster order groups on other GPUs
are limited to only one mutex per pixel,
A11 can go finer-grained than that,
allowing an even lighter touch
and minimizing how often your threads are waiting for access.
Now let us look at how Metal 2 accelerates data sharing
between threads and threadgroups.
Metal 2 shading language extends atomic functions
with memory order and scope attributes.
These new additions enable new ways of flexible
and efficient sharing of data between threads.
Before Metal 2, to communicate between threadgroups
required completing the kernel execution
and issuing a new kernel to consume the outputs
of the threadgroups of the first kernel.
On Metal 2, threadgroups can communicate directly
with each other.
Additionally, with the addition of these new features,
threads within a threadgroup
can communicate without using a barrier
resulting in improved performance.
We also added some other significant features
and capabilities to Metal 2 on A11.
On A11, f16 math has overall better accuracy
through the improvements to rounding
and the maximum value handling.
A11 adds support for texture cube arrays,
and introduces read-write texture functionality.
With A11, an array of sampler coverage
comes to A-series GPUs.
A11 adds post-depth coverage feature
and provides a more flexible way of dispatching compute kernels.
A11 adds support for quad scoped permute operations as well.
For further details about these features,
please check the Metal 2 documentation.
A11 makes many significant performance improvements
to the GPU.
It has up to 2X math performance
when it comes to tasks for computer vision,
image processing, and machine learning.
But that is not the only area of improvement for performance.
Let us review the improved performance
and capabilities of A11 GPU.
We doubled F16 math and texture filtering rate
per clock cycle when compared to A10 GPU.
Please note: On A11,
using F16 data types in your shaders when possible
makes a much larger performance difference.
We doubled the maximum threadgroup size
from 512 to 1K on A11.
Maximum multiple render target size increased
from 256 bits to 512 bits.
Maximum threadgroup memory size doubled from 16K to 32K.
We also made significant improvement
over the performance of feedback operations.
They're also known as alpha-test and discard.
For more information about Metal 2
and links to the sample code,
please visit the developer website at
Thank you for watching!
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.