Learn best practices to maximize the efficiency of your Metal based apps and attain high frame rates. Gain insight into powerful tools for analyzing and optimizing performance for both the CPU and GPU. Discover how to identify bottlenecks, tune performance hot-spots, and overcome any hurdles that could keep your app from reaching its potential.
PHILIP BENNETT: Good morning,
and welcome to Metal Performance Optimization Techniques.
I'm Phil Bennett of the GPU Software Performance Group,
and I will be joined shortly by our special guest Serhat Tekin
from the GPU Software Developer Technologies Group
and he will be giving a demo of a great new tool you can use
to profile your Metal apps.
I'm sure you're going to love it.
So, Metal at the WWDC, the story so far.
In What's New in Metal Part 1, we covered great new features
that have been added to Metal as of iOS 9 and OS X El Capitan.
In What's New in Metal Part 2,
we introduced two new frameworks, MetalKit
and Metal performance shaders.
These make developing Metal apps even easier.
In this our final session,
we will be reviewing what tools are available for debugging
and profiling your Metal apps and we're going
to explore some best practices
from getting optimal performance from your Metal apps.
So let's take a look at the tools.
Now, if you have been doing any Metal app development in iOS,
you are likely to be familiar with Xcode
and its suite of Metal tools.
Now, we are going to take a quick look
at the frame debugger.
So what we have here is a capture of a single frame
from a Metal app, and on the left,
we have the frame navigator which shows all of the states
and Draw calls present in the frame.
These are grouped by render encoder, command buffer,
and if you have been using debug labels,
they will be grouped by debug groups also.
Next we have the render attachment viewer,
which shows all of the color attachments associated
with the current render pass in addition to any depth
and stencil attachments, and it shows this wire frame highlight
of the current Draw call,
which makes navigating your frame very convenient.
Next we have the resource inspector
where you can inspect all of the resources used by your app,
from buffers to textures and render attachments.
You can view all different formats,
you can individual bitmap levels, cube maps,
TD arrays, it's fully featured.
And then we have the state inspector, which allows you
to inspect properties of all of the Metal objects in your app.
Moving on, we have the GPU report,
which gives you a frames per second measurement
of the current frame and gives you timings for CPU and GPU.
In addition, it also shows the most expensive render
and compute encoders in your frame so you can help narrow
down which shaders and which Draw calls are the
And finally, we have the shader profiler and editor.
And this is a wonderful tool for both debugging
and profiling your shaders as it allows you to tweak your shaders
and recompile them on the fly, thus saving you having
to recompile your app.
It's really useful.
And as you probably are aware now,
all of these great tools are now available
for debugging your Metal apps on OS X El Capitan.
So Instruments is a great companion to Xcode
as it allows you to profile your app's performance
across the entire system, and now we are enabling you
to profile Metal performance in a similar manner with this,
the Metal System Trace Instruments.
It's a brand-new tool for iOS 9.
It allows you to profile your Metal apps
across your CPU and GPU.
Let's take a look here.
We can start by profiling Metal API usage in the application,
down to the driver, right onto the GPU
where we can see the individual processing phases,
verse X fragments, and optionally computes,
and then onto the actual display hardware.
Now, here to give us a demonstration
of this great new tool,
please welcome Serhat Tekin to the stage.
SERHAT TEKIN: Thank you, Philip, and hello, everyone.
I have something really cool to show you today,
and it's brand new, it's our latest addition
to our Metal development tools, Metal System Trace.
Metal System Trace is a performance analysis
and tracing tool for your Metal iOS apps and is available
as part of Instruments.
It lets you get a system-wide overview of your application
over time also giving you an in-depth look at the graphics
down to the microsecond level.
It's important that I should stress this.
This is available for the first time ever on our platform.
This is all thanks to Xcode 7 and iOS 9.
So without further ado, let's go ahead and give it a shot.
So I'm going to launch Instruments,
and we are at the template chooser.
You can notice that we have a new template icon here,
Metal icon for Metal System Trace.
I will go ahead and choose that.
Those of you familiar
with Instruments will realize I just created a new document
with four instruments in it, as you can see
on the left-hand side of the timeline here.
I will give you a quick tour of these instruments and the data
that they present on the timeline.
So let's go ahead and select my Metal app on the iPad
as my target app and start recording.
Now, Metal System Trace is set to record
in one instrument called Windowed Mode.
It's essentially capturing the trace into a ring buffer.
This lets you record indefinitely.
And the important point here is that when you see a problem
that you want to investigate, you can stop recording.
At that point, Instruments will gather all
of the trace data collected, process it for a while,
and they will end up with a timeline that looks like this.
So there is quite a lot of stuff going on here, so I will zoom
in to get a better look.
I can do that by holding down the Option key
and selecting an area of interest in the timeline
that I want to zoom in.
I can navigate the timeline using the tracker gestures,
two fingers swipe to scroll and pinch to zoom.
And you can see that I get more detail
on the timeline as I zoom further in.
So what are we looking at here?
Essentially what we have here is an in-depth look
of your Metal application's graphics workload over time
across all of the layers of the graphics stack.
The different colors that we go
through in the timeline represent different workloads
for individual frames.
And the tracks themselves are fairly intuitive.
Each box you see here represents an item's trace relative start
time, end time, and how long it took.
Starting from the top and working our way down,
we have your application's usage of the Metal framework.
Next, we have the graphics driver processing your command
buffers, and if you have any shader compilation activity
midframe, it also shows up in the track.
This is followed by the GPU hardware track,
which shows your Render
and Compute commands executing on the GPU.
And finally we have the display surfaces track.
Essentially, this is your frame getting displayed on the device.
So another thing you can see here is these labels.
Now, note that these two labels here, shadow buffer and G-buffer
and lighting, are labels I assigned myself to my encoders
in my Metal code using the encoder's Label property.
These labels propagate their way down the pipeline along
with the workload they are associated with,
which makes it very easy
to track your scenes rendering passes here
in Metal System Trace.
I highly recommend taking advantage of this.
And if anything is too small to fit its label,
you can always go hover over the ruler and see a tool tip
that displays both the label and the duration at the same time.
The order of the tracks here basically map
to the same order your Metal commands would work their way
down the graphics pipeline.
So let us go ahead and follow this command buffer
down the pipe.
So at the top track I can see my application's use
of Metal command buffers and encoders,
specifically what I see here is the creation time
and submission time for both my command buffers
and rendering compute encoders.
At the top I have my command buffer,
and at the bottom I have my relevant encoders created
by this command buffer directly nested underneath.
Now, note this arrow here at the submission time
of the command buffer going to the next track.
Dependencies between different levels
of the pipeline are represented by these arrows
in Metal System Trace.
So, for instance, when this command buffer is submitted,
its next stop is going to be the graphics display driver,
if I can zoom in there and get a better look.
Look at how much we are taking here.
It's really, really fast, and they are still
on the CPU side barely consuming anything.
Similarly, I can go and follow the arrows once the encoders are
The encoders are going to get submitted to the GPU track.
Following the arrows the same way,
I can see my encoders getting processed on my GPU.
This GPU track is separated into three different lanes,
one for vertex processing,
one for fragment, and one for compute.
So, for instance, here I can see my shadow buffer rendering code
for my shadow buffer pass going
through its vertex processing phase and moving
on to the fragment phase, which happens to overlap
with my G-buffer and lighting phase as well.
Something that is desirable.
A quick note here is that the vertex fragment also compute
processing costs have more than just the shader processing time.
For instance, we are running on iOS,
and it's a tile-based deferred architecture,
so the vertex processing cost is going
to include the tiling cost as well.
It's something to keep in mind.
Finally, once my frame is done rendering, the surface is going
to end up on the display, which is shown
in the track at the bottom.
Essentially, it's showing me what time my frame was swapped
onto the display and how long it stayed there.
Underneath that, we have the resync track,
which shows us the resync intervals separated
by these spikes that correspond to individual resync events.
Finally, at the bottom, we have our detail view.
The detail view is similar
to what you would see in other instruments.
It offers contextual detail based
on the instrument use selected.
For instance right now,
I have the Metal application instrument selected,
so I can go ahead and expand this to see all of my frames
and all of the command buffers and encoders along
with the hierarchy involved.
This track is useful if you want to see, say, precise timing.
If I go to the encoder list,
precise creation submission timings
or what process something originated from.
It's very useful.
Cool! So this timeline look
at the graphics pipeline is an incredibly powerful tool.
It's available for the first time with iOS 9 and Metal.
So how do you use this to help you solve your problems?
Or how does a problem app look?
Let me go ahead and open a different trace
to show you that.
In a couple of minutes, Philip will go into a lot more detail
than I will about Metal performance
and how you can use this tool for that purpose.
But I'm going to give you a quick overview
of the tool's workflow and a quick couple of tips.
First and foremost, you need to be concerned
about your CPU and GPU parallelism.
You can see that this trace that I opened,
labeled Problem Run appropriately,
is already sparser than the last trace we took.
This is because we have a number of sync points
where the CPU is actually waiting on the GPU.
You need to make sure you eliminate these.
Also, another useful thing to look for is the pattern
that you see on the timeline.
These frames are all part of the same scene, so they are going
to have really high temporal locality.
Any divergence you see might point
at a problem you should investigate.
Another important thing is the display surfaces track.
So ideally, if your frame rate target is 60 frames per second,
these surfaces should be staying on display
for a single VSync interval.
So we should be seeing surfaces getting swapped
at every VSync interval.
This particular frame, for instance, stayed on for three,
so we are running at 20 fps.
Another thing that pretty useful is the shader compilation track
directly shows you if the shader compiler is kicking
in at any time during your trace.
One thing that you want
to particularly avoid is submitting work
to the shader compiler midframe because it's going
to waste CPU cycles you can use on other things.
Phil will explain this in a couple more minutes in detail.
Finally, you should aim to profile early and often.
A workflow like this will help you figure out problems
as they occur and make it easier to fix them.
And Xcode helps you with that by offering a profile launch option
for your build products.
It's going to automatically build a release version
of your app, installed on the device,
and start an instruments run with a template of your choice.
So you have our first look at Metal System Trace.
Available for all of your Metal-capable iOS devices
Please give it a try.
We are looking forward to your feedback and suggestions.
Now, I will leave the stage back to Phil,
who will demonstrate a couple of key Metal performance issues
and how you can use our tools to identify these.
PHILIP BENNETT: Thank you, Serhat,
that was very informative.
Now, we are going to cover the aforementioned Metal performance
best practices, and we are going to use the tools
to see how we can diagnose
and hopefully follow these best practices.
So let me introduce our sample app, or rather a system trace
of our sample app, and immediately we can see
that there are several performance issues.
To begin with, there is no parallelism
between the CPU and the GPU.
These are incredibly powerful devices,
and the only way you are going
to obtain the maximum performance is
by having them run independently,
whereas here they seem to be waiting on the other.
So we can see there is a massive stall
between processing frames on the CPU.
There is a whopping 22 milliseconds.
We shouldn't have any stalls.
What's going on there?
And if we look at the actual active period of the CPU,
it exceeds our frame deadline.
We were hoping for 60 frames per second.
So we had to get everything done within 16 milliseconds.
And we have blown past that.
And things don't look much better on the GPU side, either.
There is a lengthy stall in proportion to what is on the CPU
because the CPU has been spending all its time doing
nothing of note and hasn't been able to queue
up work for the next frame.
Furthermore, the active GPU period overshoots the frame
deadline, and we are shooting for 60 frames per second,
but it looks like we are only getting 20.
So what can we do about this?
Well, let's go back to basics.
Let's first examine one of the key principles
of Metal design and performance.
And that's creating your expensive objects
in state upfront.
Now, in a legacy app, typically what would happen would be
during content loading, the app would compile all of its shaders
from source, and that could be dozens or even hundreds of them,
and this is a rather time-consuming operation.
Now, this is only half of the shared accompilation story
because the shaders themselves need to be compiled
into a GPU pipeline state in combination
with the various state used.
So what some apps might attempt to do is
to do something known as prewarming.
Now, normally the device compilation would occur
when the shaders and states were first used in a Draw call.
That's bad news.
Imagine you have a racing game and suddenly you turn a corner
and it draws in a lot of new objects
and the frame rate drops.
That's really bad.
So what prewarming does is you issue a load of W Draw calls
with various combinations of graphic states and shaders
in the hope that the driver will compile the relevant GPU
So when the time comes
to actually draw using this combination state and shaders,
everything is ready to go and you don't get a frame rate drop.
Now, in the actual rendering loop,
there would typically be your setting of states,
and if you actually get around to any,
maybe you will do some Draw calls as well.
So the Metal approach is to move the expensive stuff ahead
Shaders can be compiled from source offline.
That's already saving a chunk of work.
We move state's definition ahead of time.
You define your state.
The GPU pipeline state is compiled
into these state objects.
So when you come to actually do the Draw calls, there is none
of that device compilation nonsense, so there is no need
for a shade of warming anymore.
It's a thing of the past.
That leaves the rendering loop free for Draw calls.
Loads of Draw calls.
Metal facilitates upfront state definition
by decoupling expensive state validation and compilation
from the Draw commands, thus allowing you to pull this
out of the rendering loop and keep the rendering loop
for actual Draw calls.
Now, the expensive-to-create state is encapsulated
in these immutable state objects, and the intention is
that you will create these once and reuse them many times.
Now, getting back to our sample app,
here we see there is some shader compilation going on midframe,
and we are wasting about a millisecond here.
That's no good at all.
And if we look at the Xcode's frame debugger,
look at all of this happening in a single frame.
Look at all of these objects.
We don't want any of this.
All that you should be seeing is this, the creation
of the command buffer for the frame and the acquisition
of the drawable and its texture.
All of the rest is completely superfluous.
So let's cover these expensive objects
and when you should create them.
And we are going to begin with shader libraries.
These are your library of compiled shaders.
Now, what you really want to do is compile all of them offline.
You can use Xcode, any Metal source files
in your project will automatically be compiled
into the default library.
Now, your app may have its own custom content pipeline,
and you might not necessarily want to use this approach.
So for that, we provide command-line tools,
which you can integrate into your pipeline.
If you absolutely cannot avoid compiling your shaders
from source in runtime, the best you can do is create
So you create the library, and in the meantime, your app,
or rather, the calling threads,
can get on with doing something else,
and once the shader library has been created,
your app will be asynchronously notified.
Now, one of the first objects you will be creating
in your app will be the device and command queue.
And these represent the GPU you will be using and its queue
of ordered command buffers.
Now, as we said, you want
to create these during app initialization
and because they are expensive to create,
you want to reuse them throughout the lifetime
of your app.
And, of course, you want to create one per GPU used.
Now, next is the interesting stuff, the render
and compute pipeline state, which encapsulates all
of the programmable GPU pipeline states,
so it takes all the descriptors, your vertex formatter scripts,
render buffer formats, and compiles it
down to the actual raw pipeline state.
Now, as this is an expensive operation,
you should be creating these pipeline objects
when you load your content, and you should aim
to reuse them as often as you can.
Now, as with the libraries,
you can also create these asynchronously using
So once created, your app will be notified
by a completion handler.
One point to mention is that unless you actually need it,
you shouldn't obtain the reflection data
as this is an expensive operation.
So next we have the depth stencil and sampler states.
These are the fixed-function GPU pipeline states,
and you should be creating these when you load your content along
with the other pipeline states.
Now, you may end up with many, many pieces of depth stencil
and sampler states, but you needn't worry about this
because some Metal implementations will internally
hash the states and create loads of duplicates
so don't worry about that.
Now, next we have the actual data consumed by the GPU.
You have got your textures and your buffers.
And you should, once again, be creating these
when you load your content, and reuse them as often as possible,
because there is an overhead associated with both allocating
and deallocating these resources.
And even dynamic resources, you might not be able
to fully initialize them ahead of time, but you should
at least create the underlying storage.
And we are going to be covering more on that very soon.
So to briefly recap.
So the most expensive states obviously should be created
ahead of time, so these are the shader libraries
that you aim to build offline.
The device and the command queue, which are created
when you initialize your app, the render
and compute pipeline states,
created when you load your content,
as are the fixed function pipeline state,
the depth stencil and sampler states,
and then finally the textures and buffers
that are used by your app.
So we went ahead and we applied this best practice
to our example app, which you may remember looked like this.
We had some shader compilation occurring midframe every frame,
and now we have got none.
So already we have saved about a millisecond of CPU time.
This is a good start, but we will see
if we can do better soon.
So in summary, create your expensive state and objects
up front and aim to reuse them.
Expecially compile your shader source offline, and you want
to keep the rendering loop for what it's intended for.
It's for Draw calls.
Get rid of all of the object creation.
Now, what about the resources you can't entirely create
We are talking about these dynamic resources,
so what do we do about them?
How can we efficiently create and manage them?
Now, by dynamic resources, we are talking
about resources which, once created, may be modified many,
many times by the CPU.
And a good example of this is buffer shader constants,
and also any dynamic vertex and index buffers you might have
for things like particles generated on the CPU,
in addition to dynamic textures,
perhaps your app has some textures which it modifies
in the CPU between frames.
So ideally given the choice,
you would put these resources somewhere which is efficient
for both the CPU and the GPU to access.
And you do this with the shared storage mode option
when you create your resource.
And this creates resources in memory shared
by both the CPU and the GPU.
Now, this is actually the default storage mode on iOS,
iOS devices being unified memory architecture,
so the same memory is shared between the CPU and GPU.
Now, the thing about these shared resources is the CPU has
completely unsynchronized access to them.
It can modify the data as freely as it wants through a pointer.
And in fact, it's quite easy for the CPU to stomp all
over the data which is in use by the GPU,
which tends to be pretty catastrophic.
So we want to avoid that.
But how can we achieve this?
Well, the brute force approach would be to have a single buffer
for the resource, where we have, say, a buffer of constants
which are updated on the CPU and consumed later by the GPU.
Now, if the CPU wants to modify any of the data
in the constants buffer, it has to wait
until the GPU is finished with it.
And the only way it can know that is if it waits
for the command buffer in which the resource is referenced
to finish processing on the GPU.
And for that, in this case we use Wait Until Completed.
So we wait around, rather the CPU waits around,
until the GPU is finished processing
and then it can go ahead and modify the buffer,
which is consumed by the GPU in the next frame.
Now, this is really bad because not only is the CPU stored
but the GPU is stored as well because the CPU hasn't had time
to queue up work for the next frame.
This is what is happening in the example app.
The CPU is waiting around for the GPU to finish on each frame.
You are introducing a massive store period, and, yes,
there is no parallelism between the CPU and the GPU.
So we need a better approach clearly,
and you might be tempted to just create new buffers every frame
as you need them.
But as we learned in the previous section,
that's not a particularly good idea
because there is an overhead associated
with creating each buffer.
And if you have many buffers, large buffers, this will add up,
so you really don't want to be doing this.
What you should do instead is employ a buffer scheme.
Here we have a triple buffering scheme,
where we have three buffers, which are updated on the CPU
and then consumed by the GPU.
Typically we suggest that you limit the number
of command buffers in flight to three, and effectively,
you have one buffer per command buffer.
And by employing a semaphore to prevent the CPU
from getting too far ahead of the GPU, we can ensure
that it's safe to update the buffers on the CPU
when the GPU wraps around, when it goes back
to reading the first buffer.
Rather than bore you with a lot of sample code,
I will point you straight at a great example we already have.
That is the Metal Uniform Streaming example,
which shows you exactly how to do this.
So I recommend you check it out afterward if you are interested.
Getting back to our example app,
you may remember we had these very performance-crippling
weights between each frame on the CPU.
Now, after employing a buffering scheme to update dynamic data,
we managed to greatly reduce the gap between processing
on both the CPU and the GPU.
We still have some sort of synchronization issue,
but we are going to look into that very shortly.
So we are making good progress already.
And in summary, you want to buffer
up your dynamic shared resources
because it's the most efficient way of updating these
between frames, and you enforce safety via use of the buffers
and flights that I mentioned.
Now, I'm going to talk about something
or rather the one thing you don't actually want to do
up front, and that relates
to when you acquire your app's drawable service.
Now, the drawable surface is your app's window on the world,
it's what your app renders its visible content into,
which is either displayed directly on the display
or it may be part of a composition pipeline.
Now, you retrieve the drawables from the Metal layer
of Core Animation, but there is only a limited number
of these drawables because they are actually quite big,
and we don't want to keep loads of them around nor do we want
to be allocating them whenever we need them.
So these drawables are maintained very limited,
and predrawables are relinquished
at display intervals once they have been displayed
in the hardware.
And each stage of the display pipeline may actually be holding
onto a drawable at any point from your app, to a GPU,
to Core Animation if you have any compositing,
to the actual display hardware.
Now, your app grabs a drawable surface typically
by calling the next drawable method.
If you are using MetalKit, this will be performed
when you call Current Render Pass Descriptor.
Now, the method will only return once a drawable is available,
and if there happens to be a drawable available at the time,
it will return immediately.
Great, you can go on and continue with the frame.
However, if there are none available your app,
or rather the calling for it, will be blocked
until at least the next display interval waiting for a drawable.
This can be a long time.
It's 60 frames per second.
We are talking 16 milliseconds.
So that's very bad news.
So is this what our example app was doing?
Is this the explanation for these huge gaps in execution?
Well, let's see what Xcode says.
So we go to the frame navigator, and we take a look
at the frame navigator here.
And Xcode seems to have a problem
with our shadow buffer encoder.
See a little warning there.
So if we take a closer look,
we see that indeed we are actually calling the next
drawable method earlier than we should do.
The next code offers some very sage advice
that we should only call it when we actually need the drawable.
So how does this fit in with our example app?
Well, we have several passes here in our example app,
and we were acquiring the drawable right at the start
of each frame before the shadow pass.
This is far too early, because right up until the last pass,
we are drawing everything off screen,
and we don't need a drawable right up until we come
to render the UI pass.
So the best place to acquire the next drawable is naturally right
before the UI pass.
So we went ahead and we made the change, we moved our call
to next drawable later, and let's see
if that solved our problem.
Well, as you can already see, yes, it did!
We removed our second synchronization point,
and now we don't have any stalls between processing
on the frame processing on the CPU.
That's a massive improvement.
So the advice is very simple: only acquire the drawable
when you actually need it.
This is before the render pass in which it's actually used.
This will ensure that you hide any long latency
that would occur if there weren't any drawables available.
So your app can continue to do useful work,
and by the time it actually needs a drawable,
one is likely to be available.
So at this point we are doing pretty well so far.
But there is still room for improvement.
So why don't we look at the efficiency
of the GPU side rather than diving to a very low level, say,
trying to optimize our shaders or change texture formats,
whatever, why don't we see
if there is any general advice we can apply.
As it so happens, there is.
That relates to how we use Render Command Encoders.
Now, a Render Command Encoder is what is used
to generate Draw commands for a single rendering pass.
And a single rendering pass operates on a fixed set
of color attachments, and depth and stencil attachments.
Once you begin the pass, you cannot change these attachments.
However, you can change the actions acting on them,
such as the depth stencil state, color masking
and blending, for instance.
And this is valuable to remember.
Now, the way in which we use our render encoders particularly
important on the iOS device GPUs due to the interesting way
in which they are architected.
They are tile-based deferred renderers.
So each Render Command Encoder results in two GPU passes.
First you have the vertex phase, which transforms all
of the geometry in your encoder, and then performs clipping,
coloring, and then bins all of the geometry
into screen space tiles.
This is followed by the fragment phase, which processes all
of the objects tile by tile to determine
which objects are visible,
and then only the visible pixels are actually processed.
And all of the fragment processing occurs
in these fast on-chip tile buffers.
Now, typically at the end of a render you only need
to store out the color buffer.
You would just discard the depth buffer.
And even sometimes you may have, say, multiple color attachments,
but you only need to store one of them.
By not storing the tile data in each pass,
you are saving quite a bit of bandwidth.
You are avoiding writing out entire frame buffers.
This is important for performance, as is not having
to load in data each tile.
So what can Xcode tell us?
Can it give us -- or rather,
I mentioned that each encoder corresponds
to a vertex pass and a fragment pass.
And this applies even for MT encoders,
and this is quite important.
Here we have actually two G-buffer encoders,
and the first one doesn't seem to be drawing anything.
I guess that just slipped in there by mistake,
but this actually has quite an impact on performance if we look
at the system trace of the app.
Just that empty encoder consumed 2.8 milliseconds on the GPU,
and presumably it was just writing a clear color
out to however many attachments we had, three color
and two depth and stencil.
And our total GPU processing time
for this particular frame is 22 milliseconds.
Now, if we remove the MT encoder,
which is done very easily because it shouldn't be there
in the first place, we go down to 19, so that's a very nice win
for doing very little at all.
So watch out for these MT encoders.
If you are not going to do any drawing
in a pass, don't start encoding.
So let's look a bit deeper now.
Let's have a look at the render passes in our example app
and see what we have got.
So we have got a shadow pass,
which renders into a depth buffer.
We have a G-buffer pass, which renders
into three color attachments and a depth and stencil attachment,
and then we have these three lighting passes,
which use the render attachment data from the G-buffer pass,
either sampling through the texture units or loading
to the frame buffer content.
And when the lighting passes use this data,
and they perform lighting and outputs
to a single accumulation target
which is used several times over.
And finally you have a user interface pass
onto which user interface elements are drawn
and presented to the screen.
So is this the most efficient setup of encoders?
Once again we summon Xcode's frame debugger to see
if it has anything to say.
And once again, yes it does.
It has taken issue with our sunlight encoder.
So let's take a closer look.
We are inefficiently using our command encoders.
And Xcode is kind enough to tell us
which ones we could actually combine.
So let's go ahead and merge a couple of passes.
Rather than merge just two, we can actually merge three,
which all operate on the same color attachment.
So let's go ahead and do that.
So we have six passes here, and now we are going
to merge them down to four.
So what impact did that have on performance, GPU side?
Let's go back to the GPU, the system trace.
Here we can see we have gone from 21 milliseconds,
six passes, down to 18 by not having to write out all
of that load and store all of that attachment data.
So that's quite a nice win.
But could we go any further?
Let's return to our app.
So we have four passes, and is it actually possible
to combine both the G-buffer and the lighing pass to avoid having
to store out five attachments and keep everything on chip?
Well, it in fact is.
We can do that with clever use of programmable blending.
So I'm not going to go into too much detail there,
but what we did was we combined these two encoders down to one.
So now we are left with three render encoders
and we are having to load and store far,
far less attachment data, and that's a massive win
in terms of bandwidth.
So let's see what impact that had.
Actually not a lot.
That was very unexpected.
We have only chopped off about a millisecond.
That's not great.
I was hoping for more than that.
So once again, can Xcode save us?
We turn to Xcode's frame debugger.
And we take a closer look at the load and store bandwidth
for the G-buffer encoder.
Now, it turns out that we are actually still loading
and storing quite a lot of data, and the reason
for that is quite simple.
It looks like here we have mistakenly set our loads
and store actions for each attachment incorrectly.
We only wanted to be storing the first color attachment,
and we want to discard the remaining color attachments
in addition to the depth and stencil attachments,
and we certainly don't want to be loading them in.
So if we make the very simple change, we change our load
and store actions to something more appropriate,
we have reduced our load bandwidth down to zero
and we have massively reduced the amounts
of attachment data we're storing.
So now, what impact did that have?
So before, with our three passes,
we are taking 17 milliseconds on the GPU.
Now, we are down to 14.
That's more like it.
So to summarize, don't waste your render encoders.
Try to do as much useful work as possible in them,
and definitely do not start encoding
if you are not going to draw anything.
And if you can, and with the help of Xcode, merge encoders
which are rendering to the same attachments.
This will get you big wins.
Now, we are doing pretty well on the GPU side now.
In fact, we are actually within our frame budget.
But is there anything we can do on the CPU side?
If you remember, I think we were actually still slightly beyond
our frame budget.
What about multithreading?
How could multithreading help us?
What does Metal allow us to do in terms of multithreading?
Fortunately for us, Metal was designed with multithreading
in mind and has a very efficient threadsafe and scalable means
of multithreading your rendering.
It allows you to encode multiple command buffers simultaneously
on different threads, and your app has control over the order
in which these are executed.
Let's take a look at a possible scenario
where we might attempt some multithreading.
But before that, I would like to stress
that before you even go ahead and try
to multithread your rendering,
you should actively pursue the best possible
So make sure there is nothing terribly inefficient
in there before you start trying to multithread things.
Okay. So we have an example here where we have two render passes,
and we are actually taking so long to encode these two passes
on the CPU that we are actually missing our frame deadline.
So how can we improve this?
Well, we can go ahead and we can encode the two passes
And not only have we managed to reduce the CPU time per frame,
the side effect is that the first render pass can be
submitted to the GPU quicker.
So how would this look in terms of Metal objects?
How does it come together?
Where we start with our Metal device in the command queue
as usual, and now for this example we are going
to have three threads.
And for each thread, you need a command buffer.
Now, for the two threads, each has a Render Command Encoder
which is operating on separate passes,
and on our third thread we might have multiple encoders
So it goes to show the approaches
to multithreading can be quite flexible,
and once they have all finished their encoding,
the command buffers are submitted to the command queue.
So how would you set this up?
It's quite simple.
You create one command buffer per thread and you go ahead
and initialize render passes as usual,
and now the important point here is the order
in which the command buffers will be submitted to the GPU.
Chances are this is important to you.
So you enforce it by calling the Enqueue method
on the command buffers, and that reserves a place
in the command queue so when the buffers are eventually
committed, they will be executed in the order
that they were enqueued.
This is an important point to remember.
Because then we create the render encoders for each thread,
and we go ahead and encode our draws on the separate threads
and then commit the command buffers.
It's really very simple to do.
Now, what about another scenario
which could potentially benefit from multithreading?
So here again we have two passes,
but one of them is significantly longer than the other.
Could we split that up somehow?
Yes, we can.
Here, we will break it up into two separate passes.
We have three threads here.
One is working on the first render pass,
and we have two dedicated to working on chunks of the second.
And, again, here by employing multithreading we are
within our frame deadline, and we have got a bit of time
to spare on the CPU as well
for doing whatever else we fancy doing.
It need not necessarily be more Metal work.
So how would we, or rather what would this look like?
So once again, we have the device and the command queue.
And for this example, we are going to be using three threads.
But here we only want one command buffer.
Next, we have the special form of the Render Command Encoder,
the Parallel Render Command Encoder.
Now, this allows you to split work for a single encoder
over multiple threads, and this is particularly important to use
on iOS because it ensures
that the threaded workloads are later combined
into a single pass on the GPU.
So there is no loading and storing between passes.
This is very important that you use this if you are going
to split up a single pass across multiple threads.
So from the Parallel Render Command Encoder,
we create our three subordinate command encoders,
and each will encode to the command buffer now,
because we are multithreading they may finish encoding
at indeterminate times,
not necessarily any particular order.
Then the command buffer submitted to the queue.
Now, it's entirely feasible
that you could even have parallel Parallel Render
The multithreading possibilities are not quite endless,
but very flexible.
Or you could have like we saw earlier,
you could have a fourth thread
which is executing encoder serially.
So how do we set this up?
Well, we begin by creating one command buffer per Parallel
Render Command Encoder.
So no matter how many threads you are using,
you only want one command buffer.
We then proceed to initialize the render pass as usual,
and then we create our actual parallel encoder.
Now, here is the important bit.
When we create our subordinate encoders,
the order in which they are created determines the order
in which they will be submitted to the GPU.
This is something to bear in mind when you split
up your workload for encoding over multiple threads.
Then we go ahead and we encode our draws and separate threads,
and then finish encoding for each subordinate encoder.
Now, the second important point is all
of the subordinate encoders must have finished encoding before we
end encoding on the parallel encoder.
And how you implement this is up to you.
Then finally, the command buffer is committed to the queue.
So we went ahead and we decided to multithread our app.
Look what turned up.
So previously, we had serial encoding or passes.
This was taking 25 milliseconds of CPU time.
Now, we pursued an approach where we encode the shadow pass
on one thread, and the G-buffer pass and UI pass on another,
and now we are down to 15 milliseconds.
That's quite a nifty improvement,
and we have got a bit of time left over on the CPU as well.
So as far as multithreading goes,
if you find that you are still CPU bound and you have done all
of the investigations you can,
and determining you haven't got anything silly going
on in your app, and that you could actually benefit
from multithreading, you can encode render passes
simultaneously on multiple threads.
But should you decide to split up a single pass
across multiple threads,
you want to use the Parallel Render Command Encoder to do so.
Now, what did we learn in this session?
Well, we introduced the Metal System Trace tool,
and it was great.
It offers new insight into your app's Metal performance.
And you want to use this in conjunction with Xcode
to profile early and often.
And as we have seen, you should also try
to follow the best practices set out,
so you want to create the expensive state up front
and reuse it as often as possible.
We want to buffer dynamic resources
so we can efficiently modify them between frames
without causing stalls.
We want to make sure we are acquiring our drawable
at the correct point in time.
Usually at the last possible moment.
We want to make sure we are efficiently using our Render
We don't have any empty encoders,
and we have coalesced any encoders which are writing
to the same attachment down to one.
And then if we find we are still CPU bound as we were
in this case, we might consider the approaches Metal offers
for multithreading our rendering.
So how did we do?
Well, now look at our app!
We don't have any runtime shader compilation.
Furthermore, our GPU workload is within the frame deadline.
As is the CPU workload.
And there are no gaps between processing of frames on the CPU.
And we even got quite fancy and decided to do multithreading.
We have a lot of time left over there to do other things.
And we managed to meet our target,
which in this case was 60 frames per second.
So well done us!
So now, the talk is over, and if you would
like any more information on anything mentioned
in this session, you can visit our developer portal,
you can also sign up for the developer forums,
and should you have any detailed questions or general inquiries,
you can direct them to Allan Schaffer, who is our Graphics
and Games Technologies Evangelist.
So thank you very much for attending this talk.
And we hope you found it interesting,
and enjoy the rest of WWDC!
Thank you very much!
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.