Metal is the powerful low-overhead graphics and compute technology designed to unlock the power of the GPU. Check out the latest additions to the Metal frameworks and get details about supporting tessellation in your apps and games. Discover how to take control over synchronization and learn how to use resource heaps for even more efficient memory usage. See what's new in Metal debugging and profiling tools and gain insight into analyzing and optimizing performance.
My name is Aaftab Munshi.
And my colleagues and I are really excited to share
with you the new features in Metal
in macOS Sierra and iOS 10.
But let's begin by highlighting the sessions we have
on Metal this year at WWDC.
So yesterday we had two sessions that talked
about adopting Metal in your application.
And today we have three sessions.
So this session and the two sessions
that cover the new features in Metal, which is then followed
by another session where we'll talk
about optimizing your Metal shaders.
So let's look at the features we're going to talk about.
So in the second session the features we will be talking
about are function or shader specialization and being able
to write to resources such as buffers and textures
from your fragment and vortex shader,
wide color using wide color displays in your application
and texture assets, and some new additions we've added
to Metal performance shaders, specifically using axillary
and [inaudible] networks on the GPU with Metal.
In this session we're going to talk about some
of the improvements we have added to Tools,
which we think you guys are going to really love.
We've also made resource heaps
and resource allocations much faster
and given you more control.
So we'll talk about that resource heaps
and memoryless render targets.
And I'm going to be talking about tessellation.
So let's begin.
So the first thing, let's spend a little bit of time trying
to understand why we need tessellation.
So we are seeing applications such as games rendering more
and more realistic visual content.
So what that means is in order to render such content,
we need to be able to send detailed amount
of geometry to the GPU.
That's where we're going to send this input.
That means lots and lots of triangles that have
to be processed, which means a large increase
in memory bandwidth.
It would be really nice
if instead we could just describe this geometry
that we want to send to the GPU as a lower resolution model,
call it a core smash, and then have the GPU generate the
So in fact, that's what tessellation does.
Tessellation is a technique that you can use to amplify
and refine the details of your geometric object.
We have two important requirements we need to meet.
The first is that the high-resolution model,
the triangles that are generated do not get stored
in graphics memory.
We don't want to pay that bandwidth cost.
And the second is a method that's used needs
to be programmable.
So let's look at an example.
So here is a screenshot from GFXBench 4.0,
which is a benchmark released by [inaudible].
And one of the key features it focuses on is tessellation.
So here's a screenshot
of the car that's being rendered without tessellation.
You can see those rims.
They're very polygonal.
You wouldn't drive a car like that, would you?
Even the body panels have cracks in them.
And the reason for that is this is the actual geometry that's
So you can see not a lot of triangles, which is great --
it's exactly what we want.
What tessellation does is takes that input geometry
and produces something like that.
I think this is really cool.
So if you look at the wire frame,
you can see the GPUs actually generating,
now we're rendering lots and lots of triangles, okay?
And that's the power of tessellation.
So let's look at how tessellation works in Metal.
So just like we did with Metal, you know,
we wanted to take a clean sheet approach, right?
We wanted to design something that was --
even though there are existing API's
that do support tessellation that you may be familiar with,
we wanted something that was really simple to graph,
you know, easy to use, and we did not want
to leave any performance on the table.
And we think we have achieved that, and I hope you agree
after this presentation.
So tessellation is available in macOS Sierra and on iOS
with the A9 processor.
So let's -- the things I'm going to talk about is well,
how does the Metal graphics pipeline look
like for tessellation?
How do I render my geometry with tessellation?
And then how do I adopt it in my application?
So let's begin.
So today when you send primitives to the GPU
with Metal, you're sending triangles, lines, or points.
With tessellation, you're sending what we call a patch.
And put simply, a patch is just a parametric surface
that is made up of spline curves.
What does that mean?
You may have heard of things
like Bezier patches or B-spline patches.
So you describe a patch by a set of control-points.
So in this figure you see is a B-spline patch.
So you have 16 control-points or control vertices.
And what tessellation does put simply is allows you to control,
okay, how many triangles do I use to render this patch?
So you may decide, "You know what?
I don't really want a lot of triangles.
I don't care how it looks."
So you may decide just four triangles is more than enough
and you'll get a polygonal look.
Or you decide, "Hey, I really want this looking nice
That would take a lot more triangles.
But you have that control.
So let's start.
So the first stage in the graphics pipeline
when we're doing tessellation is we call it a
And what it does is it takes the patch --
we talked about the patch with the control-points as input --
and decides, okay, how much do I need to subdivide this?
How many triangles do I want the GPU to generate, right?
This information is captured in what we call
as tessellation factors.
And I'll talk a little bit
about what these factors are a few slides later.
And you can also generate additional patch data
if you need it in a later stage.
The key thing this is a programmable stage,
that means you're writing code.
So once you've written [inaudible] tessellation
factors, the next stage is called the tessellator.
So this is a fixed function stage.
So no code to write.
But you do net knobs to configure it, okay?
So it takes those tessellation factors
and breaks the patch up into triangles.
And the key thing the tessellator does here is
that it does not store
that triangle list it generates in graphics memory.
In addition to the triangle list it has generated,
for each vertex in the triangle list it will generate what we
call a parametric coordinate -- the U and the V value.
And it uses this along with the control-points
to compute the actual position on the surface.
Okay? All right.
So the tessellator generates triangles.
Today in Metal when you want to render primitives,
you send triangles to the GPU.
What is the first thing
that happens is a vertex shader is executed, right?
Well, here the tessellator's generating triangles.
So if you think logically,
the next stage would be a vertex shader, and it is.
We just call it the post-tessellation vertex shader
because it's operating on the triangles
that are generated by the tessellator.
And so it's going to execute for the vertices of the triangles
that the tessellator generated and it's going
to output transform positions.
So if you're familiar with DirectX,
it's this shader plays the same, similar role
as the domain shader does in DirectX.
And then the rest of the pipeline remains the same.
We have the rasterizer and the fragment shader, right?
So you may ask, "Well, so I need to write this compute kernel
to generate the tessellation factors.
Well, can I use the vertex or fragment shader?"
Of course you can.
In fact, you don't even need to write a shader
to generate these factors; you may have precomputed them
and you can just load them in a buffer and pass
that to the tessellator.
So you have a lot of control.
But if you are generating these factors in the GPU, we recommend
that you use a compute kernel.
Because guess what?
That allows us to run that kernel asynchronously
with other draw commands.
So netting you a performance win
and I think you guys will like that.
Well, actually let's take it a step further.
You don't even need to run this kernel every frame.
Because guess what?
If you have computed the tessellation factors --
let's say you decide, "Hey, objects close
to the camera get much more tessellation,
objects further away not as much."
So once I've computed them, then depending
on how the object is moving, I can just apply a scale
and the tessellator takes that.
So really, the pipeline is really, really simple.
We have four stages.
So let's compare it with the graphics pipeline
So without tessellation we have three stages --
we have vertex shade, the rasterizer,
and the fragment stage.
With tessellation we added a new stage, the tessellator.
It's fixed function so you don't have to write any shader.
And the vertex shader became the post-tessellation vertex shader.
We think this is really simple to understand.
I hope you agree.
So how do I render my geometry with tessellation?
There are four things I'm going to talk about.
Okay. Let's look at this post-tessellation
or post-tess vertex shader; how is this different
from the regular vertex shader?
How do I pass my patch inputs?
And I told you that the tessellator's configurable.
So let's look at how we configure it
and then draw patches.
So, well, meet the new shader, same with the old shader.
So in fact, you declare a post-tessellation vertex shader
with a vertex qualifier.
But in addition to that, you also specify this attribute
which says, "Hey, it's working on a patch."
There are two kinds of patches -- a quad and triangle patch.
And you see the number next to that?
That number tells you how many control-points this patch is
So if you had a regular vertex shader,
you would have passed a vertex ideas input.
Now you pass a patchID as input.
Remember I told you the tessellator generated a
parametric UV coordinate?
Well, that's what this position in patch input is.
And then if you had a regular vertex shader,
you would have passed something as stage in,
the patch input we passed at the stage in.
Everything else you just bring computations
and you're generating a transformed vertex output.
And that's actually going to be exactly identical
because the next stage with
or without tessellation is a rasterizer.
So let's look at patch inputs.
So if you had a regular vertex shader,
you would have described your vertex input
as a struct, okay, in your shader.
And if you had decoupled the date type, that means the layout
and the buffers where the vertex inputs are coming
from do not match the declaration in the shader,
then you would have used the MTLVertexDescriptor
to describe the layout.
Well, for patches there are two inputs.
One is the per-patch input.
And remember, I told there are one or more control-points?
So we need to specify those as inputs as well.
But it looks identical how you specify these.
So you use a MTLVertexDescriptor to specify the layout
of the patch input data in memory.
And as I showed you the slide before, we declared that input
as a stage in as well.
And you use the attribute index to identify an element as input
in the shader with the corresponding declaration
in your MTLVertexDescriptor.
Since there can be more than one control-point, we basically have
to declare it using a template type.
And I'll talk about that in the next slide.
So let's look at an example.
So here I have my control-point data.
It has two elements.
So I'm using attributes zero and one.
And my per-patch data, which is attributes two and three.
So we combine these two things together
and this is my patch input for every patch.
So notice that control templated type patch underscore control
So that's what tells the Metal shading compiler "Hey,
this is referring to control-point input."
Okay? And remember I told you about this number 16
or whatever the number is?
That also tells the Metal shading compiler how many
control-points there are.
So now we have all information we need to get the patch input.
And so we just pass that as stage in.
It's pretty simple, I think.
So okay, how do I configure knobs?
So there are properties
in the MTLRenderPipelineDescriptor you
A few examples are you can tell the tessellator the method you
want to use to generate the triangles;
it's called the partitioning mode.
You can also specify a max tessellation level.
And we think this is really, really useful
because it allows you to control the maximum amount of geometry
that the GPU will generate for your tessellated objects.
Remember, the tessellator needs to read these factors.
So you need to specify the buffer of where they come from.
So use the setTessellationFactorBuffer API
to do that.
Now, these factors, so they tell how much
to subdivide the patches along the edges and on the inside.
So we have two kinds of patches.
If it's a triangular patch,
there are three edges and one inside.
If it's a quad, then you have four edges and two insides.
So you specify these as half precision floating point values
that you pass in.
And then drawing.
So today when you're drawing primitives,
you're sending triangles to be rendered by the GPU,
you're either going to call drawPrimitives
You the specify the start vertex, number of vertices.
And if your vertex indexes are not continuous,
you will pass an index buffer.
Well, to draw patches, you call drawPatches
You specify the start patch, the number of patches.
And if you're control-point indexes are not continuous,
you specify an index buffer.
So it's just a one-to-one mapping.
And then there is the DrawIndirect variants.
And what these are is that you do not specify
where the start patch and how many patches
and other information when you make the draw call,
but instead you pass a buffer.
And that gets filled out with this information
by a command that's running on the GPU, just like you would do
for drawPrimitives as well.
So really, if you don't know how to use drawPrimitives,
then drawPatches just works very similarly.
Okay? So we think this is really easy to use.
So hold on.
So I've shown you what Metal tessellation is
and how to use it.
As many of you may be familiar with
or already using tessellation in your application using DirectX
or OpenGL, you will notice Metal tessellation's a
We've designed Metal tessellation
so it's incredibly straightforward
to move your existing tessellation code to Metal.
As an example, for the past few weeks we've been working
And in an incredibly short period of time they've been able
to integrate Metal Tessellation in the engine.
And here's what they have to say.
So we're really excited that support for Metal Tessellation,
Metal Compute and the ability to write native Metal shaders
in Unity's coming later this year.
It's incredibly exciting.
And we've also been working with Epic
to efficiently integrate Metal Tessellation in Unreal Engine 4.
And Epic is planning to release their support
in UE4 later this year, okay?
So we have UE4, we have Unity supporting Metal Tessellation.
Well, let me show you tessellation in action
in these game engines
by demonstrating two commonly used rendering techniques called
adaptive tessellation and displacement mapping.
So here we have a simple demo developed
by a few Apple engineers using Unreal Engine 4.
So let's turn tessellation off, which I have,
and get wire frame mode.
You can see there are not a lot
of triangles being sent to the GPU.
This is great.
This is exactly what we want.
We want to keep the amount of geometry we send to the GPU
to be as little as possible.
Let's turn tessellation on and see what happens.
You can see now the GPU is generating a lot more triangles.
And adaptive tessellation is a technique that allows
to control the geometric detail where it matters.
So in this example we've decided that objects that are closer
to the camera need more detail.
So let's draw them with a lot more triangles
versus objects further away do not.
So the regions in blue represent regions of lowest amount
of tessellation, and the region in red represents the regions
with the highest amount of tessellation.
I can show you as I move the slider to the right,
I can use that to increase my tessellation level
and you can see objects closer will become red.
Okay? Well, let's turn wire frame mode off.
And if you run -- as we go through this cave,
you can see there's a lot more detail, right?
If I turn tessellation off, all that detail is gone, it's lost.
Turn tessellation on, it looks really amazing.
So this is an example of how I can use tessellation
to really create rich visual scenes in my application.
And I wanted to thank the great folks at Epic
for making this happen.
So the next demo is displacement mapping running on Unity.
So here we have a sphere being rendered.
Well, let's look at how many triangles we're using
to render the sphere.
Not a lot, right?
There are about 3,000 triangles.
And what displacement mapping is, is a technique
that allows you to displace the geometry
to create incredible detail.
And it does that by looking up --
using a displacement map, which is a texture.
So you look up, you know, from a texture, from this texture
and then use that to [inaudible] the vertex position.
Or you may actually do this procedurally if you wanted to.
But displacement mapping requires that, you know,
you're drawing lots and lots of really, really,
really small triangles.
Otherwise it doesn't work.
It creates artifacts, it just cracks.
But that's fine, you know?
We can use tessellation.
That's what it's here for.
Because we still want to send 3,000 triangles,
smaller triangles to the GPU
and use tessellation to generate that.
So let's turn wire frame mode off
and let's turn displacement mapping on.
As you can see now incredible detail on the sphere, right?
If I turn wire frame mode on,
you can see we're generating a lot more triangles
and they are really, really small.
In fact, let's actually animate the displacement map
so you can see the shapes changing
and let's zoom in to see detail.
You can see self-shadowing happening.
And the reason self-shadowing is happening here is
because we're actually changing the geometry,
unlike a technique many of you may be familiar
with called bump mapping
which just creates an illusion of realism.
So this is another technique which you can use
with tessellation to create incredible detail
in your application that you're rendering.
And hey, thank you to Unity for this demo.
Metal Tessellation can also be used
to accelerate digital content creation tools.
As an example OpenSubdiv is an open source library released
And it implements high-performance
Actually, it has been integrated into a number
of third-party digital content creation tools,
such as Maya from Autodesk.
And OpenSubdiv uses tessellation
to render these subdivision surfaces.
Well, we -- Apple -- have added Metal Tessellation
And I'm really excited to announce here that we plan
to release these changes
to the OpenSubdiv open source project later this summer.
Okay. I mean, here's what Pixar has to say.
As you can see, Pixar's really excited
to see a native Metal implementation
of OpenSubdiv in iOS and macOS.
So now you may be asking, "Well, what about me?
How do I move my existing tessellation code to Metal?"
Well, let me show you how.
So we'll take DirectX an as example here,
but the same rules apply to OpenGL.
So here is what the DirectX graphics pipeline looks
like with tessellation.
We have three new stages -- two of them are programmable.
They're called the hull and the domain shader.
And then we have this tessellator in the middle.
Right? So, well, okay.
How do I move this to Metal?
Notice where the domain shader sits.
It sits right after the tessellator.
Does it remind you of any other shader I showed you
in the Metal pipeline?
Yeah, I think so.
Yeah, post-tessellation vertex shader.
Because guess what?
The domain shader
with tessellation really becomes the new vertex shader.
And just like you can very easily move your HLSL
or GLSL vertex functions to Metal,
you can move these domain shaders pretty easily
to the post-tessellation vertex shader.
The tessellator is exactly the same, no changes.
So really, we have this guy, these two shaders,
the vertex and hull shader.
And we got to make them into a kernel.
Okay. Let's look at how we can do that.
So let's look at some -- since we have a vertex shader,
that means there's probably a vertex descriptor described
at runtime by the application.
And that means -- because the data's probably going
to be decoupled.
So that means I need to declare stage in.
But I don't do stage in in a kernel.
Right? Well, now you can.
We've added support for it.
So just like in a vertex shader you use stage
in to say this is my vertex input, you can use stage
in to say this my per thread input.
And you can specify the actual data layout
in a MTLStage inputOutputDescriptor.
It behaves identically.
It's very similar to a MTLVertexDescriptor.
Some of the things you specify are a little different
because this is for compute, not for vertex.
And then two things to observe.
With tessellation DirectX or OpenGL,
the vertex shader executes on the control-point of a patch.
And the hull shader has these two functions.
One that executes on a control-point and one
that executes on a patch.
The per-patch hull function is what actually generates your
So the best thing to do?
Translate all these three functions to Metal functions.
And then we'll write a Metal kernel
that will call these functions.
But don't worry, we're not going to make function calls.
The Metal compiler will in-line these.
Okay? So let's look at how this works.
So each thread basically is going
to call the control-point function for the vertex
and for the hull, right?
So let's say there were 16 control-points.
So the first thread calls the vertex
and control-point hull function,
second thread does the same thing, and so on.
Right? And any intermittent data that they produce that they want
to share, they'll put that in thread group memory,
which is this local memory
which is high-performance, very low-latency.
So we're not going after graphics memory.
And then if there were 16 control-points,
there will be 16 threads operating on these.
Only one of them need to execute the per-patch hull function.
That means you typically have a barrier,
and then you will execute --
only one of the thread will execute the hull functions.
You have a conditional check saying, "Hey,
is my thread in thread group ID0?
Then call this thing."
And this is the function
that will output the tessellation factors
to graphics memory.
If you had any additional patch data you wanted
to output, you could do so.
And if you really, really, really, really wanted
to output the control-point data, you can do so.
But we find in most case the control-point data is just
It's the nature of the graphics pipeline,
and these are the existing API's
which requires you to pass them through.
But you're just passing them through; don't write it out.
You already have them in your buffer, okay?
Let me close.
So I hope I have shown you
that Metal Tessellation is simple and easy to use.
We designed it from the ground up for performance.
I've shown you how easy it is
to adapt your existing tessellation code to Metal.
It's available on iOS and macOS.
So now it's your turn.
Show us, you know, use tessellation
and create some amazing visuals
that you can render in the application.
So I want to thank you for your time.
I'm going to call my colleague, James, and he's going to talk
to you about resource heaps and memoryless render targets.
Thank you, Aaftab.
For the next part of this session I'm excited
to introduce two new Metal features available in iOS
and tvOS - resource heaps and memoryless render targets.
These features enable you to take control
of your resource management for greater CPU
and memory efficiency.
I'll introduce resource heaps first,
followed by memoryless render targets.
So resource heaps are a new lower overhead resource
management option in Metal.
Now, you can already create buffers and textures in Metal,
so why do we need another way?
Well, creating resources through the existing Metal API
with a device is easy and convenient
and many developers appreciate the simplicity.
On the other hand, as many
of your Metal apps render increasingly rich
and complex scenes, you asked for finer control
over your Metal resources to unlock greater CPU
and memory efficiency.
That's why we are introducing resource heaps.
Resource heaps enable fast resource creation and binding
through resource sub-allocation.
The flexibility of resource heaps saves you memory
by allowing multiple resources to alias in memory.
And finally, the efficiency and flexibility
of resource heaps is made possible by you taking control
over tracking resource dependencies
with explicit command synchronization.
Now, let's dive into each one of these features starting
with resource sub-allocation.
Before talking about the details of sub-allocation,
let's first discuss why device-based resource creation
Creating an individual resource
with a Metal device involves multiple steps:
Allocating the memory; preparing the memory for the GPU;
clearing the memory for security; and then, finally,
creating the Metal object.
Each one of these steps takes time and a majority
of the time is spent in memory operations.
But there are situations when you need to create resources
on your performance-critical path
without introducing performance hitches.
Texture streaming is one example
or perhaps you have an image processing app that needs
to generate a number
of temporary textures to execute a filter.
The cost of binding resources
to command encoders can also become a performance issue.
Metal must track each unique resource bound
to a command encoder to make sure
that the GPU can access the memory.
And for complex scenes, this cost can add up as well.
Resource sub-allocation addresses both
of these performance issues.
Remember that the expensive part of resource creation is
in the memory operations.
With resource heaps you can perform the memory operations
ahead of time outside of your game loop.
Resource heaps address the binding cost by allowing you
to sub-allocate many logical resources from a single heap.
By sub-allocating multiple resources from one heap,
Metal tracks one memory allocation instead
of one per individual resource.
This significantly reduces your driver overhead.
Now, let's compare resource creation
between the Metal device and the new Metal resource heap.
When you create a resource with a device, Metal will allocate
and prepare a block of memory
and then create the Metal object.
So for four resources, Metal will allocate
or prepare four blocks of memory.
Now, compare that to the MTLHeap.
When you use a MTLHeap for resource creation,
you first create the heap object ahead of time.
Memory will allocate and prepare a block of memory
of the requested size.
And if you do this ahead of time outside of your render loop,
the expensive part of resource creation is complete.
Now, to create four resources out of the MTLHeap,
Metal only needs to reserve a piece of the heap's memory
and create the resource metadata.
This is much faster.
Now let's see what happens when we want
to release some resources.
When a device-based resource is released,
the Metal object is destroyed,
but the device will also free the memory resource allocation.
On the other hand, when releasing a heap resource,
only the object is destroyed.
The memory is still owned by the heap.
So creating a new resource
on the device will incur another expensive memory allocation,
whereas the heap can quickly reassign the free memory
to another resource.
Let me show you how easy it is
to sub-allocate Metal resources with Swift.
So like many Metal objects,
the Metal resource heap has a corresponding descriptor object.
So let's create a heap descriptor and set the size
to the amount of memory to back the heap.
With the heap descriptor we can ask the device
to create us a heap object.
Remember, this is the slower operation, so do this ahead
of time, like when your app starts
or at content loading time.
With the constructed heap,
we can call its resource creation methods,
which should look very familiar since the name
and arguments are the same as the device equivalents.
So before moving on to the next topic I'd
like to share some best practices
for using resource heaps for sub-allocation.
Now, the most important tip is to use resource heaps
to create resources on your performance-critical path.
Creating resources using the device is not designed
for your game loop; resource heaps are.
Allocating resources of varying sizes can lead to fragmentation
of a heap's memory
if the resources have varying lifetimes.
So use multiple heaps and bucket resources by size
to limit the effects of fragmentation.
Now, you may also be wondering how
to choose an appropriate heap size.
Well, Metal provides two new methods on the Metal device
to query the size and alignment of a texture and buffer.
Use these queries to help you calculate the heap size
that you need.
Okay. Let's move on to the next feature
of resource heaps -- Resource aliasing.
Resource aliasing allows multiple dynamic resources
to occupy the same memory,
therefore reducing the total memory footprint
of the resources.
Dynamic resources have contents that are regenerated each frame
and include things like your shadow maps, your G buffer data,
or temporary textures used in post-processing.
Here we have a heap containing two nonaliasing resources.
Compare that to this heap containing the same two
resources but now they are aliasing.
Now, you can obviously see
that the aliasing resources can fit inside a much smaller heap.
Let's apply resource aliasing to this game frame.
The shadow map passes render a set of shadow maps --
one for each light in the scene.
So here in our heap we have a number of shadow maps.
And in the main pass during fragment processing the shaders
will sample the shadow maps to determine
if each object is in shadow.
Now, after the main pass ends, the contents
for the shadow maps are completely consumed.
They will be regenerated in the next frame.
So after the main pass ends, we execute a post-processing chain
that can consist of a number of off-screen render passes,
each executing a specific filter like a blur or bloom.
These filters will store their contents into textures
to pass filter results to the next stages the chain.
Now, the key takeaway here is that the contents
for the shadow maps and the post-processing textures are
never used at the same time.
So why not share the memory?
So let me show you how to create these aliasing resource sets
Now, the first section should look familiar.
First we ask the device to create us a heap
and we create our three shadow maps.
Okay. Now we see a new method, makeAliasable.
By calling makeAliasable
on a heap resource you are telling the heap to consider
that resource's memory to be free.
The shadow maps are still active, but their memory is free
to be reassigned by the heap to new resources.
So now when we create the post-processing textures
on the same heap, they can occupy the same memory
as the shadow maps.
So now let's talk about some best practices
for resource aliasing.
To maximize memory reuse
for dynamic resources call resource creation methods
in the same sequence that their resources are used in a frame.
That will allow you to call makeAliasable --
that will allow you to interleave makeAliasable calls
when the resource contents have been consumed.
And you want to keep dynamic
and static resources in separate heaps.
Static resources are generally not aliasable and can end
up preventing dynamic resources from aliasing
with each other due
to fragmentation of the heap's memory.
Next I'm going to talk about how to synchronize command access
to your heap resources.
So, so far we have discussed fast resource creation
with sub-allocation and efficient memory usage
with resource aliasing.
But remember that resource heaps are fast and flexible
because you control the synchronization
of heap resources.
This is something you do not have to do
with device resources.
But unlike device resources, Metal won't know
when a command modifies the contents of a heap resource
like when a render pass stores new contents to a texture.
Metal also doesn't know when you're changing interpretation
of the heap's memory from one aliasing set to another.
But for correctness, Metal needs know
when a command is updating a heap resource
so that other commands can safely read the results.
This is especially important
because the GPU can execute multiple commands in parallel.
So to synchronize access to heap resources,
your application will create and manage GPU fences
to communicate resource dependencies across commands.
Let's take a closer look at how GPU fences work.
So a GPU fence is the timestamp.
It is a reference point in the GPUs execution timeline.
Now, you can encode two actions with fences
to synchronize commands.
A command can update a fence to move the timestamp forward
when the command is finished.
And a command can wait on a fence to wait
until the GPU has reached the most recent fence update
Okay. Let's bring back the previous game frame
and I will show you how to use fences
to synchronize command access to the aliasing heap resources.
So here again is the example frame, a three-part frame,
but now we have five boxes because two
of the render stages, render passes are split in the vertex
and fragment processing steps.
So we have a shadow pass, a main pass,
and finally a post-processing pass
that we will execute with compute.
So Metal commands are submitted
in serial order to the command queue.
So maybe it's not quite clear
yet why we need any synchronization across commands.
But GPUs are very parallel machines and can operate
on multiple commands in parallel.
GPUs in our iOS and tvOS products can execute vertex,
fragment, and compute commands all in parallel
to maximize GPU utilization.
The GPU can even be working
on multiple frames at the same time.
So maybe now you spot a problem.
Look at these two commands that are highlighted.
They are both updating the aliasing
and heap resources at the same time.
We have to use a fence to fix this.
So first let's bring in a fence.
The post-process command will update the fence
so that the shadow commands fragment processing stage can
wait on the fence.
Right? So now the two commands don't execute
at the same time anymore.
So I'm going to show you how to encode this fence update
and fence wait with Swift.
First, we create a fence with a device.
This is a new method -- no arguments.
Next, let's encode the post-processing compute encoder
at the end of the first frame.
We first create a computeCommandEncoder
and encode the dispatches.
But before we end the encoder, we first update the fence
so that subsequent commands can wait
until this command has finished executing.
So in the next frame we would encode the shadow rendering.
So we create a renderCommandEncoder
in commandBufB, which represents the command buffer
for the next frame.
But before drawing the scene, we first encode a fence wait
to wait until the post-processing is completed
on the GPU.
Now, notice this time there are two arguments.
There's a second argument called beforeStages.
Render commands execute in two stages -- vertex and fragment.
So Metal allows you to specify the particular stage that needs
to wait for the fence.
In our example only the fragment stage needs
to access the heap resources, so we specify the fragment stage.
Finally, we can render our shadow maps safely
because we know that this command will only execute
after the previous frame's post-processing is complete.
Okay. Let me talk about some best practices
for command synchronization.
So you know that if you use heaps, you have to use fences
to synchronize command access.
But you are given this control
because you know you have more knowledge
about how your resources are used
and your application will be more CPU-efficient
than if Metal were to track all of this for you.
For example, textures that are initialized once
and never modified don't even need to be tracked.
And as another example,
resources that are used together can be tracked together
with a single fence.
So let me summarize the main ideas of resource heaps.
Create resources faster with suballocation.
Use your memory budget more efficiently
with resource aliasing.
And synchronize your heap updates
across GPU commands with GPU fences.
Okay. Now I'd like to introduce another new feature available
in iOS and tvOS: Memoryless render targets.
Now, this sounds a little magical,
but I will show you how almost every Metal app can use this
feature to save a significant amount of memory
with a single line of code.
So memoryless render targets are simply textures
that do not allocate any system memory for the texture contents.
Without any memory backing the texture contents,
what remains is the texture's metadata,
such as the texture's dimensions and internal texture format.
Now obviously this is a huge memory savings,
but when can you use a memoryless render target?
You can use them for render pass attachments that are not stored.
Most Metal apps will have some attachments associated
with a store action of don't care or multisample resolve.
And the textures used for those render pass attachments can
To make a memoryless render target,
you can simply create the texture as you normally would
with an additional storage mode flag --
This feature is supported only on iOS and tvOS
because it relies
on the tile-based rendering architecture
of A7 and later GPUs.
Let me show you how this feature works.
Here on your right we have two render pass attachment --
a color attachment and a depth attachment.
Now, A7 and later GPUs execute render passes one tile
at a time, taking advantage of a fast GPU tile storage
at the heart of the GPU.
The GPU tile storage contains tile-sized representations
of your depth, stencil, and color attachments.
And this tile storage is completely separate
from the texture backing and system memory.
Now, in Metal your load and store actions control how
to initialize the GPU tile storage and whether
to copy the results from the GPU tile storage back
to system memory.
If an attachment is not loaded from memory and it is not stored
to memory, you can make the texture
for that attachment memoryless
to eliminate the memory allocation.
Next, I'll describe some very common scenarios
where you can apply this feature to your app.
Depth attachments are frequently used
to enable depth testing in 3-D scenes.
But the A7 and later GPUs perform depth testing completely
in GPU tile storage one tile at a time.
Depth testing does not need to use system memory.
So if you don't store the depth texture for use in later passes,
make the texture memoryless and save the memory.
Let me show you another opportunity.
When executing multisample rendering, again,
the A7 and later GPUs perform all the rendering
in GPU tile storage.
The MSAA color attachment texture is only used
if you choose to store the sample data for a later use.
But most apps will choose the multisample resolve store action
which results directly from the GPU tile storage
to the resolve color attachment texture.
So in that case make the multisample color attachment
texture memoryless and this is a massive memory savings.
As you can see, the savings
for adopting this feature are substantial.
By making a 1080p depth texture memoryless,
your app will save almost 8 megabytes.
If you are rendering to the native resolution
of a 12.9-inch iPad Pro,
the savings for the depth buffer is over 20 megabytes.
And the savings for making a four times multisample render
target memoryless are even larger, four times larger.
So use memoryless render targets to make the most
of your application's memory budget.
Use the savings to lower the memory footprint of your game.
Or better yet, use the savings to add more beautiful
and unique content to your game.
Okay. I'd like to invite Jose up to tell you all
about the improvements to the Metal Tools.
Thank you, James.
So outside the great additions
to the Metal API we did some great improvements
to Metal Developer Tools I want to show you.
First we'll talk about what's in Metal System Trace.
Than we'll introduce a new feature called GPU Override.
And we have some very exciting new features coming
to GPU Frame Debugger.
So what is Metal System Trace?
In the [inaudible] Metal session we presented this graph showing
you Metal working on power in CPU and GPU.
Metal System Trace is a set of instruments
for visualizing just that,
helping you understand the timeline
of your Metal applications
through the whole graphic pipeline, from the CPU
to the GPU, and then on to the display.
Last year at WWDC we introduced Metal System Trace
for iOS platform.
I highly recommend checking out last year's presentation
for a great overview of Metal System Trace.
Later in the fall we added support for tvOS.
And today we're happy to announce Metal System Trace
for macOS to help you squeeze out the last bit of performance
on all Metal platforms.
We improved Metal System Trace across the board,
extending the events that we report.
we visualized expensive resource operations to just picking data
from system memory to video memory.
Like in this case where we can see painting in macOS,
which is causing a delay in GPU execution.
Metal System Trace also displays debug groups,
which make it easier for you
to understand command encoded relations in your trace.
On macOS we support tracing multiple GPUs at the same time,
which is unbelievable for those use cases
where you're distributing work across different GPUs.
And on iOS we now display scalar workloads
so that you can diagnose when you're introducing latency
by rotating or scaling your views.
You can now use a wider range
of instruments alongside Metal System Trace
such as Time Profiler, File Activity,
Allocations, and many more.
Even different views such as CPU data,
which will show you CPU core time slices.
These will help you to correlate Metal events into context,
deepening the understanding
of how the system is running your application
and allowing you to diagnose things
such as GPU starvation caused by CPU stall due
to a [inaudible] operation.
Metal System Trace captures a wealth of data.
So we made it easier for you to interpret and navigate.
With the new workload highlighting, you can focus
on any command encoder or command buffer
as it works through the pipeline.
And with with support for keyboard navigation,
you can quickly move your selection through your trace.
Finally, I want to introduce Performance Observation.
And what Performance Observation does is present you
with a comprehensive list of the potential issues we found
in your trace from analyzing it.
From display surface taking too long
to unexpected shader compilations,
or high GPU execution times, Performance Observations finds
for you the events which you are looking for,
which you can navigate straight to them
from the Performance Observation list.
All these new additions will allow you
to tune your Metal applications to run as smoothly
as you want them to be.
And now for a demonstration
of our awesome GPU debugging improvements,
let me hand over to my colleague, Alp.
I have a number of great features to show you today.
So let's dive right in.
I have my app running here,
cruising over beautiful terrain tessellated to finest details.
Wouldn't it be great to see this terrain in wire frame
to see triangles individually?
The good news is our newest feature, GPU Overrides,
gives you ability to modify your Metal rendering right
from the debug bar while your app is running.
We have a number of different overrides you can mix and match,
including wire frame mode.
Let's switch to wire frame mode
to see how tessellated the terrain is.
Visualizing each triangle you might want
to tune your tessellation to fine the balance
between performance and visual quality.
Normally you'd have to go back
and change your code, recompile, and run.
But with GPU Overrides, you can experiment
with your tessellation scaling right from the Overrides menu.
Let's set scaling to 25%.
Now we have far less triangles but lost some
of the interesting details.
Let's try 75%.
I think this looks better.
Let's see it without the wire frame.
Okay. I like this one.
Now, we have less triangles than what we started with
but still have all the nice details.
And with the performance gains,
I can add more cool effects to my scene.
So as seen here, GPU Overrides is a great tool to help
with initial diagnosis for some of the visual
and performance problems in your scene.
Next, let's capture the frame to show you some of the features
that will greatly improve your debugging workflow.
The frame capture is done and I am looking
for the terrain resources to see how we are [inaudible].
Let's switch to all GPU objects in Resource Center
where you can see all your textures and buffers.
So we have all of resources here.
And going over everything one by one
to find terrain resources could take some time.
This is where the new filter bar comes to help.
You can filter by any properties you see here, such as label,
type, size, or details.
Since I labeled all my resources,
I'll just filter by terrain.
And right here I have all the resources used
for rendering the terrain.
Now that I found the terrain patches buffer, what I would
like to do is to see where I'm actually using it.
With a simple drag and drop I can filter function navigator
to show me all the calls that's made
to terrain patches buffer just like that.
In this case, I see where it is calculated using compute
and where it says [inaudible] while rendering the terrain.
This filter is really powerful.
I can also use any other properties
of the bound resources to filter draw calls.
For example, if you filter by SRGB,
you'll see all the draw calls that are using a texture
with SRGB pixel format.
This is a natural way of navigating
around your frame quickly.
Next, let's move to bound GP objects
to see how we are using these resources to render the terrain.
In bound mode your resources are grouped
under different sections based on the stage
of the Metal pipeline they are used
in so you know exactly where to look.
Looking at the vertex stage,
terrain patches is a buffer bound to multiple binding points
with different offsets.
Let's use our only buffer [inaudible] to inspect the data.
All the vertex data has stayed nicely
with the layout except [inaudible] Metal function
So this is using the exact same struct
as your post-vertex function.
And we have a color data here.
It recognizes the word color and visualizes the real color
of the value right in there.
Since this is a large buffer that contains different types
of data, I have added some debug markers
with the new [inaudible] API, which makes it extra easy
to find what you are looking for.
With the layout menu, you can jump straight
to any other available layout you would like to inspect.
Looking at individual buffers is great.
What is even better is the new input attribute view
which lets you see all your vertex data
as your vertex shader sees it.
Input attributes collects all the data from your instances,
tessellation factor buffers, and your stage in data,
then provides you a single view to look at all of it together.
In this case we are rendering instances with multiple patches
and I can see what data belongs to which patch of an instance.
So that was a quick look at some
of our newest GPU Frame Debugger features.
Let's switch back to slides and wrap up.
So you've just seen some
of our newest GPU Frame Debugger features.
I would like to tell you about two more.
With the new Extended Validation mode the GPU Frame Debugger can
perform even deeper analysis of your application,
providing recommendations [inaudible] the optimal texture
usage or storage mode for your resources.
You can enable this mode from the Xcode scheme editors.
And the new support
for stand-alone Metal Library Projects lets you create Metal
libraries to be shared in multiple apps
or include multiple of them in a single app just
like any other framework or library.
So we talked about features
that will greatly improve your tool's experience.
Now let's summarize what we have seen so far in this session.
We have seen the great additions to Metal API with tessellation,
resource heaps and memoryless render targets,
then we showed you improved tools, Metal System Trace
and GPU Frame Debugger.
Be sure to stick around for part two this afternoon
where I will talk about function specialization
and function resource read-writes, wide color
and texture assets, and additions
to Metal performance shaders.
For more information about this session,
please check the link online.
You can catch the video and get links
to documentation and sample code.
We had great sessions yesterday, which are available online.
And this afternoon we have What's New in Metal, Part2,
then Advanced Metal Shader Optimization in this room.
Thanks for coming, and have a great WWDC.
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.