The Metal shading language is an easy-to-use programming language for writing graphics and compute functions which execute on the GPU. Dive deeper into understanding the design patterns, memory access models, and detailed shader coding best practices which reduce bottlenecks and hide latency. Intended for experienced shader authors with a solid understanding of GPU architecture and hoping to extract every possible cycle.
My name is Fiona and this is my colleague Alex.
And I work on the iOS GPU complier team and our job is
to make your shaders run on the latest iOS devices,
and to make them run as efficiently as possible.
And I'm here to talk about our presentations,
Advanced Metal Shader Optimization, that is Forging
and Polishing your Metal shaders.
Our compiler is based on LVM.
And we work with the Open Source committee
to make LVM more suitable for use on GPUs by everyone.
Here's a quick overview of the other Metal session,
in case you missed them,
and don't worry you can watch the recordings online.
Yesterday we had part one and two of adopting Metal
and earlier today we had part one and two of what's new
in Metal, because there's quite a lot that's new in Metal.
And of course here's the last one,
the one you're watching right now.
So in this presentation we're going to be going over a number
of things you can do to work with the compiler
to make your code faster.
And some of this stuff is going to be specific to A8
and later GPUs including some information
that has never been made public before.
And some of it will also be more general.
And we'll be noting that with the A8 icon you can see there
for slides that are more A8 specific.
And additionally, we'll be noting some potential pitfalls.
That is things that may not come up as often as the kind
of micro optimizations you're used to looking for,
but if you run into these, you're likely to lose
so much performance, nothing is going to matter by comparison.
So it's always worth making sure you don't run into those.
And those will be marked
with the triangle icon, as you can see there.
Before we go on, this is not the first step.
This is the last step.
There's no point to doing low-level shader optimization
until you've done the high-level optimizations before,
like watching the other Metal talks
from optimizing your draw calls, the structure
of your engine and so forth.
Optimizing your later shader should be roughly the last thing
And, this presentation is primarily
for experienced shader authors.
Perhaps you've worked on Metal a whole lot and you're looking
to get more into optimizing your shaders, or perhaps your new
to Metal, but you've done a lot of shader optimization
on other platforms and you'd like to know how
to optimize better for A8 and later GPUs,
this is the presentation for you.
So you may have seen this pipeline if you watched any
of the previous Metal talks.
And we will be focusing of course
on the programmable stages of this pipeline,
as you can see there, the shader course.
So first, Alex is going to go
over some shader performance fundamentals
and higher level issues.
After which, I'll return for some low-level,
down and dirty shader optimizations.
Let me start by explaining the idea
of shader performance fundamentals.
These are the things that you want to make sure
that you have right before you start digging
into source level optimizations.
Usually the impact of the kind
of changes you'll make here can dwarf
or potentially hide other more targeted changes
that you make elsewhere.
So I'm going to talk about four of these today.
Address space selection for buffer arguments,
buffer preloading, dealing
with fragment function resource writes,
and how to optimize your computer kernels.
So, let's start with addresses spaces.
So since this functionality doesn't exist
in all shading languages, I'll give a quick primer.
So, GPUs have multiple paths for getting date from memory.
And these paths are optimized for different use cases,
and they have different performance characteristics.
In Metal, we expose control over which path is used
to the developer by requiring that they qualify all buffers,
arguments and pointers in the shading language
with which address space they want to use.
So a couple of the address spaces specifically apply
to getting information from memory.
The first of which is the device address space.
This is an address space with relatively few restrictions.
You can read and write data through this address space,
you can pass as much data as you want, and the buffer offsets
that you specify at the API level have relatively flexible
On the other end of things, you have the constant address space.
As the name implies, this is a read only address space,
but there are a couple of additional restrictions.
There are limits on how much data you can pass
through this address space, and additionally the buffer offsets
that you specify at the API level have more stringent
However, this is the address space that's optimized for cases
with a lot of data reuse.
So you want to take advantage
of this address space whenever it makes sense.
Figuring out whether or not the constant address space makes
sense for your buffer argument is typically a matter
of asking yourself two questions.
The first question is, do I know how much data I have.
And if you have a potentially variable amount of data,
this is usually a sign that you need
to be using the device address space.
Additionally, you want to look at how much each item
in your buffer is being read.
And if these items can potentially be read many times,
this is usually a sign that you want to put them
into the constant address space.
So let's put this into practice with a couple of examples
from some vertex shaders.
First, you have regular, old vertex data.
So as you can see, each vertex has its own piece of data.
And each vertex is the only one that reads that piece of data.
So there's essentially no reuse here.
This is the kind of thing that really needs to be
in the device address space.
Next, you have projection matrices, another matrices.
Now, typically what you have here is that you have one
of these objects, and they're read by every single vertex.
So with this kind of complete data reuse, you really want this
to be in the constant address space.
Let's mix things up a little bit
and take a look at standing matrices.
So hopefully in this case you have some maximum number
of bones that you're handling.
But if you look at each bone that matrix may be read
by every vertex that references that bone,
and that also is a potential for a large amount of reuse.
And so this really ought to be
on the constant address space as well.
Finally, let's look at per instance data.
As you can see all vertices
in the instance will read this particular piece of data,
but on the other hand you have a potentially variable number
of instances, so this actually needs to be
in the device address space as well.
For an example of why address space selection matters
for performance, let's move
on to our next topic, buffer preloading.
So Fiona will spend some time talking about how
to actually optimize loads and stores within your shaders,
but for many cases the best thing that you can do is
to actually off load this work to dedicated hardware.
So we can do this for you in two cases,
context buffers and vertex buffers.
But this relies on knowing things about the access patterns
in your shaders and what address space you've placed them into.
So let's start with constant buffer preloading.
So the idea here is that rather than loading
through the constant address space,
what we can actually do is take your data and put it
into special constant registers that are even faster
for the ALU to access.
So we can do this as long
as we know exactly what data will be read.
If your offsets are known a compile time,
this is straightforward.
But if your offsets aren't known
until run time then we need a little bit of extra information
about how much data that you're reading.
So indicating this
to the compiler is usually a matter of two steps.
First, you need to make sure that this data is
in the constant address space.
And additionally you need to indicate
that your accesses are statically bounded.
The best way to do this is to pass your arguments
by reference rather than pointer where possible.
If you're passing only a single item or a single struct,
this is straightforward, you can just change your pointers
to references and change your accesses accordingly.
This is a little different if you're passing an array
that you know is bounded.
So what you do in this case is you can embed that size array
and pass that struct by reference rather
than passing the original pointer.
So we can put this into practice with an example
at a forward lighting fragment shader.
So as you can see in sort
of the original version what we have are a bunch of arguments
that are passed as regular device pointers.
And this doesn't expose the information that we want.
So we can do better than this.
Instead if we note the number
of lights is bonded what we can do is we can put the light data
and the count together into a single struct like this.
And we can pass that struct in the constant address space
as a reference like this.
And so that gets us constant buffer preloading.
Let's look at another example
of how this can affect you in practice.
So, there are many ways to implement a deferred render,
but what we find is that the actually implementation choices
that you make can have a big impact on the performance
that you achieve in practice.
One pattern that's common now is to use a single shader
to accumulate the results of all lights.
And what you can see form the declaration of this function,
is that it can potentially read any or all lights in the scene
and that means that your input size is unbounded.
Now, on the other hand if you're able to structure your rendering
such that each light is handled
in its own draw call then what happens is
that each light only needs to read that light's data
and it's shader and that means that you can pass it
in the constant address space
and take advantage of buffer preloading.
In practice we see that on A8 later GPUs
that this is a significant performance win.
Now let's talk about vertex buffer preloading.
The idea of vertex buffer preloading is
to reuse the same dedicated hardware that we would use
for a fix function vertex fetching.
And we can do this for regular buffer loads as long as the way
that you access your buffer looks just
like fix function vertex fetching.
So what that means is that you need
to be indexing using the vertex or instance ID.
Now we can handle a couple additional modifications
to the vertex or instance IDs such as applying a deviser
and that's with or without any base vertex
or instance offsets you might have applied at the API level.
Of course the easiest way to take advantage of this is just
to use the Metal vertex descriptor functionality
But if you are writing your own indexing code,
we strongly suggest that you layout your data
so that vertexes fetch linearly to simplify buffer indexing.
Note that this doesn't preclude you from doing fancier things,
like if you were rendering quads and you want to pass one value
to all vertices in the quad, you can still do things
like indexing by vertex ID divided by four
because this just looks like a divider.
So now let's move on to a couple shader stage specific concerns.
In iOS 10 we introduced the ability to do resource writes
from within your fragment functions.
And this has interesting implications
for hidden surface removal.
So prior to this you might have been accustomed to the behavior
that a fragment wouldn't need to be shaded as long
as an opaque fragment came in and occluded it.
So this is no longer true specifically
if your fragment function is doing resource writes,
because those resource writes still need to happen.
So instead your behavior really only depends
on what's come before.
And specifically what happens depends on whether
or not you've enabled early fragment tests
on your fragment function.
If you have enabled early fragment tests,
once it's rasterized as long
as it also passes the early depth and stencil tests.
If you haven't specified early fragment tests,
then your fragment will be shaded
as long as it's rasterized.
So from a perspective of minimizing your shading,
what you want to do is use early fragment tests
But there are a couple additional things
that you can do to improve the rejection that you get.
And most of these boil down to draw order.
You want to draw these objects,
the objects where the fragment functions do resource writes
after opaque objects.
And if you're using these objects to update your depth
and stencil buffers, we strongly suggest
that you sort these buffer from front to back.
Note that this guidance should sound fairly familiar
if you've been dealing with fragment functions
that do discard or modify your depth per pixel.
Now let's talk about compute kernels.
Since the defining characters of a compute kernels
that you can structure your computation however you want.
Let's talk about what factors influence how you do this
First we have computer thread launch overhead.
So on A8 and later GPUs there's a certain amount of time
that it takes to launch a group of compute threads.
So if you don't do enough work
from within a single compute thread you can potentially,
it leaves the hardware underutilized
and leave performance on the table.
And a good way to deal with this and actually a good pattern
for writing computer kernels on iOS in general is
to actually process multiple conceptual work items
in a single compute threat.
And in particular a pattern that we find works well is
to reuse values not by passing them
through thread group memory, but rather by reusing values loaded
for one work item when you're processing the next work item
in the same compute thread.
And it's best to illustrate this with an example.
So this is a syllable filter kernel, this is sort
of the most straightforward version of it, as you see,
it reads as a three- [inaudible] region of its source
and produces one output pixel.
So if instead we apply the pattern
of processing multiple work items
in a single compute thread,
we get something that looks like this.
Notice now that we're striding by two pixels at a time.
So processing the first pixel looks much as it did before.
We read the 3 by 3 region.
We apply the filter and we write up the value.
But now let's look at how pixel 2 is handled.
So stents are striding by two pixels at a time we need
to make sure that there is a second pixel to process.
And now we read its data.
Note here that a 2 by 3 region
of what this pixel wants was already loaded
by the previous pixel.
So we don't need to load it again,
we can reuse those old values.
All we need to load now is the 1
by 3 region that's new to this pixel.
After which, we can apply the filter and we're done.
Note that as a result we're not doing 12 texture reads,
instead of the old 9, but we're producing 2 pixels.
So this is a significant reduction in the amount
of texture reads per pixel.
Of course this pattern doesn't work for all compute use cases.
Sometimes you do still need to pass data
through thread group memory.
And in that case, when you're synchronizing between threads
in a thread group, an important thing to keep in mind is
that you want to use the barrier with the smallest possible scope
for the threads that you need to synchronize.
In particular, if your thread group fits within a single SIMD,
the regular thread group barrier function
in Metal is unnecessary.
What you can use instead is the new SIMD group barrier function
introduced in iOS 10.
And what we find is actually the targeting your thread group
to fit within a single SIMD
and using SIMD group barrier is often faster than trying
to use a larger thread group in order to squeeze
that additional reuse,
but having to use thread group barrier as a result.
So that wraps things up for me, in conclusion,
make sure you're using the appropriate address space
for each of your buffer arguments according
to the guidelines that we described.
Structure your data and rendering
to take maximal advantage of constant
and vertex buffer preloading.
Make sure you're using early fragment tests to reject
as many fragments as possible
when you're doing resource writes.
Put enough work in each compute thread
so you're not being limited
by your compute thread launch overhead.
And use the smallest barrier for the job when you need
to synchronize between threads in a thread group.
And with that I'd like to pass it back to Fiona to dive deeper
into tuning shader code.
Thank you, Alex.
So, before jumping into the specifics here, I want to go
over some general characteristics of GPUs
and the bottlenecks you can encounter.
And all of you may be familiar with this,
but I figure I should just do a quick review.
So with GPUs typically you have a set of resources.
And it's fairly common for a shader to be bottlenecked
by one of those resources.
And so for example if you're bottlenecked
by memory bandwidth, improving other things
in your shader will often not give any apparent
And while it is important to identify these bottlenecks
and focus on them to improve performance,
there is actually still benefit to improving things
that aren't bottlenecks.
For example, in that example if you are bottlenecked
at memory usage, but then you improve your arithmetic
to be more efficient, you will still save power even
if you are not improving your frame rate.
And of course being on mobile,
saving power is always important.
So it's not something to ignore,
just because your frame rate doesn't go up in that case.
So there's four typical bottlenecks to keep
in mind in shaders here.
The first is fairly straightforward, ALU bandwidth.
The amount of math that the GPU can do.
The second is memory bandwidth, again, fairly straightforward,
the amount of data that the GPU can load from system memory.
The other two are little more subtle.
The first one is memory issue rate.
Which represents the number of memory operations
that can be performed.
And this can come up in the case
where you have smaller memory operations,
or you're using a lot of thread group memory and so forth.
And the last one, which I'll go into detail a bit more
about later is latency occupancy register usage.
You may have heard about that,
but I will save that until the end.
So to try to alleviate some of these bottlenecks,
and improve overall shader performance and efficiency,
we're going to look at four categories
of optimization opportunity here.
And the first one is data types.
And the first thing to consider
when optimizing your shader is choosing your data types.
And the most important thing to remember
when you're choosing data types is that A8
and later GPUs have 16-bit register units,
which means that for example if you're using a 32-bit data type,
that's twice the register space, twice the bandwidth,
potentially twice the power and so-forth,
it's just twice as much stuff.
So, accordingly you will save registers,
you will get faster performance, you'll get lower power
by using smaller data types.
Use half and short for arithmetic wherever you can.
Energy wise, half is cheaper than float.
And float is cheaper than integer,
but even among integers, smaller integers are cheaper
than bigger ones.
And the most effective thing you can do to save registers is
to use half for texture reads and interpolates because most
of the time you really do not need float for these.
And note I do not mean your texture formats.
I mean the data types you're using to store the results
of a texture sample or an interpolate.
And one aspect of A8 in later GPUs that is fairly convenient
and makes using smaller data types easier
than on some other GPUs is
that data type conversions are typically free,
even between float and half, which means that you don't have
to worry, oh am I introducing too many conversions in this
by trying to use half here?
Is this going to cost too much?
Is it worth it or not?
No it's probably fast because the conversions are free,
so you can use half wherever you want and not worry
about that part of it.
The one thing to keep in mind here though is
that half-precision numerics
and limitations are different from float.
And a common bug that can come up here
for example is people will write 65,535 as a half,
but that is actually infinity.
Because that's bigger than the maximum half.
And so by being aware of what these limitations are,
you'll better be able to know where you perhaps should
and shouldn't use half.
And less likely to encounter unexpected bugs in your shaders.
So one example application
for using smaller integer data types is thread IDs.
And as those of you who worked on computer kernels will know,
thread IDs are used all over your programs.
And so making them smaller can significantly increase the
performance of arithmetic, and can save registers and so forth.
And so local thread IDs, there's no reason to ever use uint
for them as in this case,
because local thread IDs can't have that many thread IDs.
For global thread IDs, usually you can get away with a ushort
because most of the time you don't have
that many global tread IDs.
Of course it depends on your program.
But in most cases, you won't go over 2 to the 16 minus 1,
so it is said you can do this.
And this is going to be lower power, it's going to be faster
because all of the arithmetic involving your thread ID is now
going to be faster.
So I highly recommend this wherever possible.
Additionally, keep in mind that in C like languages,
which of course includes Metal, the precision
of an operation is defined by the larger of the input types.
For example, if you're multiplying a float by a half,
that's a float operation not a half operation, it's promoted.
So accordingly, make sure not to use float literals
when not necessary, because that will turn here what appears
to be a half operation, it takes a half and returns a half,
into a float operation.
Because by the language semantics,
that's actually a float operation since at least one
of the inputs is float.
And so you probably want to do this.
This will actually be a half operation.
This will actually be faster.
This is probably what you mean.
So be careful not
to inadvertently introduce float precision arithmetic
into your code when that's not what you meant.
And while I did mention that smaller data types are better,
there's one exception to this rule and that is char.
Remember as I said that native data type size on A8
and later GPUs is 16-bit, not 8-bit.
And so char is not going to save you any space or power
or anything like that
and furthermore there's no native 8-bit arithmetic.
So it sort of has to be emulated.
It's not overly expensive if you need it, feel free to use it.
But it may result in extra instructions.
So don't unnecessarily shrink things to char
that don't actually need it.
So next we have arithmetic optimizations,
and pretty much everything
in this category affects ALU bandwidth.
The first thing you can do is always use Metal built-ins
They're optimized implementations
for a variety of functions.
They're already optimized for the hardware.
It's generally better than implementing them yourself.
And in particular, there are some of these
that are usually free in practice.
And this is because GPUs typically have modifiers.
Operations that can be performed for free on the input
and output of instructions.
And for A8 and later GPUs these typically include negate,
absolute value, and saturate as you can see here,
these three operations in green.
So, there's no point to trying to "be clever" and speed
up your code by avoiding those, because again,
they're almost always free.
And because they're free, you can't do better than fee.
There's no way to optimize better than free.
A8 and later GPUs, like a lot
of others nowadays, are scalar machines.
And while shaders are typically written with vectors,
the compiler is going to split them all apart internally.
Of course, there's no downside to writing vector code,
I mean often it's clearer, often it's more maintainable,
often it fits what you're trying to do, but it's also no better
than writing scaler code from a compiler perspective
and the code you're going to get.
So there's no point in trying to vectorize code
that doesn't really fit a vector format, because it's just going
to end up the same thing in the end,
and you're kind of wasting your time.
However, as a side note, which I'll go
into more detail a lot later, in later A8 and later GPUs,
do have vector load in store even though they do not have
So this only applies to arithmetic here.
Instruction Level Parallelism is something that some
of you may have used optimizing for,
especially if you've done work on CPUs.
But on A8 and later GPUs this is generally not a good thing
to try to optimize for because it typically works
against registry usage,
and registry usage typically matters more.
So a common pattern you may have seen is a kind of loop
where you have multiple accumulators in order
to better deal with latency on a CPU.
But on A8 and later GPUs this is probably counterproductive.
You'd be better off just using one accumulator.
Of course this applies to much more complex examples
than the artificial simple ones here.
Just write what you mean, don't try to restructure your code
to get more ILP out of it.
It's probably not going to help you at best, and at worst,
you just might get worse code.
So one fairly nice feature of A8 and later GPUs is
that they have very fast select instructions
that is the ternary operator.
And historically it's been fairly common
to use clever tricks, like this to try
to perform select operations in ternaries
to avoid those branches or whatever.
But on modern GPUs this is usually counterproductive,
and especially on A8 later GPUs because the compiler can't see
through this cleverness.
It's not going to figure out what you actually mean.
And really, this is really ugly.
You could just have written this.
And this is going to be faster, shorter, and it's actually going
to show what you mean.
Like before, being overly clever will often obfuscate what you're
trying to do and confuse the compiler.
Now, this is a potential major pitfall,
hopefully this won't come up too much.
On modern GPUs most of them do not have integer division
or modulus instructions, integer not float.
So avoid divisional modulus by denominators
that are not literal or function consonants,
the new feature mentioned in some of the earlier talks.
So in this example, what we have over here, this first one
where the denominator is a variable,
that will be very, very slow.
Think hundreds of clock seconds.
But these other two examples, those will be very fast.
Those are fine.
So don't feel like you have to avoid that.
So, finally the topic of fast-math.
So in Metal, fast-math is on by default.
And this is because compiler fast-math optimizations are
critical to performance Metal shaders.
They can give off in 50% performance gain or more
over having fast-math off.
So it's no wonder it's on be default.
And so what exactly do we do in fast-math mode?
Well, the first is that some
of the Metal built-in functions have different precision
guarantees between fast-math and non fast-math.
And so in some of them they will have slightly lower precision
in fast-math mode to get better performance.
The compiler may increase the intermediate precision
of your operations like by forming a fuse multiple
It will not decrease the intermediate precision.
So for example if you write a float operation you will get an
operation that is at least a float operation.
Not a math operation.
So if you want to write half operations you better write
that, the compiler will not do that for you,
because it's not allowed to.
It can't your precision like that.
We do ignore strict if not a number, infinity steal,
and sign zero semantics, which is fairly important,
because without that you can't actually prove
that x times zero is equal to zero.
But we will not introduce a new not at new NaNs, not a number
because in practice that's a really nice way
to annoy developers, and break their code
and we don't want to do that.
And the compiler will perform arithmetic re-association,
but it will not do arithmetic distribution.
And really this just comes down to what doesn't break code
and makes it faster versus what does break code.
And we don't want to break code.
So if you absolutely cannot use fast-math for whatever reason,
there are some ways to recover some of that performance.
Metal has a fused multiply-add built in which you can see here.
Which allows you to directly request a fused
And of course if fast-math is off,
the compiler is not even allowed to make those,
it cannot change one bit of your rounding, it is prohibited.
So if you want to use fused multiply-add
and fast-math is off, you're going
to have to use the built-in.
And that will regain some of the performance,
not all of it, but at least some.
So, on our third topic, control flow.
Predicated GP control flow is not a new topic and some
of you may already be familiar with it.
But here's a quick review of what it means for you.
Control flow that is uniform across the SIMD,
that is every thread is doing the same thing,
is generally fast.
And this is true even if the compiler can't see that.
So if your program doesn't appear uniform, but just happens
to be uniform when it runs, that's still just as fast.
And similarly, the opposite of this divergence,
different lanes doing different things, well in that case,
it potentially may have to run all
of the different paths simultaneously unlike a CPU
which only takes one path at a time.
And as a result it does more work, which of course means
that inefficient control flow can affect any
of the bottlenecks, because it just outright means the GPU is
doing more stuff, whatever that stuff happens to be.
So, the one suggestion I'll make on the topic of control flow is
to avoid switch fall-throughs.
And these are fairly common in CPU code.
But on GPUs they can potentially be somewhat inefficient,
because the compiler has to do fairly nasty transformations
to make them fit within the control flow model of GPUs.
And often this will involve duplicating code and all sort
of nasty things you probably would rather not be happening.
So if you can find a nice way to avoid these switch fall-throughs
in your code, you'll probably be better off.
So now we're on to our final topic.
And we'll start with the biggest pitfall
that people most commonly run into
and that is dynamically indexed non-constant stack arrays.
Now that's quite a mouthful,
but a lot of you probably are familiar with code
that looks vaguely like this.
You have an array that consist of values that are defined
in runtime and vary between each thread or each function call.
And you index it to the array with another value
that is also a variable.
That is a dynamically indexed non-constant stack array.
Now before we go on, I'm not going to ask you to take
for grabs at the idea that stacks are slow on GPUs.
I'm going to explain why.
So, on CPUs typically you have like a couple threads,
maybe a dozen threads, and you have megabytes of cache split
between those threads.
So every thread can have hundreds of kilobytes
of stack space before they get really slow and have
to head off to main memory.
On a GPU you often have tens of thousands of threads running.
And they're all sharing a much smaller cache too.
So when it comes down to it each thread has very,
very little space for data for a stack.
It's just not meant for that, it's not efficient and so
as a general rule, for most GPU programs,
if you're using the stack, you've already lost.
It's so slow that almost anything else would have
And an example for a real world app is at the start
of the program it needed to select one of two float
for vectors, so it used a 32-byte array,
an array of two float fours and tried to select
between them using this stack array.
And that caused a 30% performance loss
in this program even though it's only done once at the start.
It can be pretty significant.
And of course every time we improve the compiler we are
going to try harder and harder to avoid, do anything we can
to avoid generating these stack access because it is that bad.
Now I'll show you two examples here that are okay.
This other one, you can see those are constants,
It's not a non-constant stack array and that's fine
because the values don't vary per threads, they don't need
to be duplicated per thread.
So that's okay.
And this one is also okay.
It's still a dynamically indexed non-constant stack array.
But it's only done dynamically indexed because of this loop.
And the compiler is going to unroll that loop.
In fact, your compiler aggressively unrolls any loop
that is accessing the stack to try to make it stop doing that.
So in this case after it's unrolled it will no longer be
dynamically indexed, so it will be fast.
And this is worth mentioning,
because this is a fairly common pattern in a lot
of graphics code and I don't want to scare you into not doing
that when it's probably fine.
So now that we've gone over the topic of how
to not do certain types of loads and stores,
let's go on to making the loads and stores
that we do actually fast.
Now while A8 and later GPUs use scalar arithmetic,
as I went over earlier, they do have vector memory units.
And one big vector loading source of course faster
than multiple smaller ones that sum up to the same size.
And this typically effects the memory issue rate bottleneck
because if you're running
through a loads, that's fewer loads.
And, so as of iOS 10, one of our new compiler optimizations,
is we will try to vectorize some loads and stores that go
to neighboring memory locations wherever we can,
because again it can give good performance improvements.
But nevertheless, this is one of the cases where working
with the compiler can be very helpful,
and I'll give an example.
So as you can see here, here's a simple loop
that does some arithmetic and reads in an array of structures,
but on each iteration, it reads just two loads.
Now we would want that to be one if we could,
because one is better than two.
And the compiler wants that too.
It wants to try to vectorize this but it can't, because A
and C aren't next to each other in memory
so there's nothing it can do.
The compiler's not allowed to rearrange your structs,
so we've got two loads.
There's two solutions to this.
Number one, of course, just make it a float to,
now it's a vector load, you're done.
One load, a set of two, we're all good.
Also, as of iOS 10, this should also be equally fast,
because here, we've reordered our struct
to put the values next to each other,
so the compiler can now vectorize the loads
when it's doing it.
And this is an example again of working with the compiler,
you've allowed the compiler to do something it couldn't before,
because you understand what's going on.
You understand how the patterns need to be
to make the compiler happy
and make it able to do a [inaudible].
So, another thing to keep in mind with loads and stores is
that A8 and later GPUs have dedicated hardware
for device memory addressing, but this hardware has limits.
The offset for accessing device memory must fit
within a signed integer.
Smaller types like short and ushort are also okay,
in fact they're highly encouraged,
because those do also fit within a signed integer.
However, of course uint does not because it can have values
out of range of signed integer.
And so if the compiler runs into a situation
where the offset is a uint and it cannot prove
that it will safely fit within a signed integer,
it has to manually calculate the address,
rather than letting the dedicated hardware do it.
And that can waste power,
it can waste ALU performance and so forth.
It's not good.
So, change your offset to int, now the problem's solved.
And of course taking advantage
to this will typically save you ALU bandwidth.
So now on to our final topic that I sort of glossed
over earlier, latency and occupancy.
So one of the core design tenants
of modern GPUs is they hide latency
by using large scale multithreading.
So when they're waiting for something slow to finish,
like a texture read, they just go
and run another thread instead
of sitting there doing nothing while waiting.
And this is fairly important
because texture reads typically take a couple hundred cycles
to complete on average.
And so the more latency you have in a shader,
the more threads you need to hide that latency,
and how many threads can you have?
Well it's limited by the fact that you have a fixed set
of resources that are shared
between threads in a thread group.
So clearly depending on how much each thread uses,
you have a limitation on the number of threads.
And the two things that are split are the number
of registers and thread group memory.
So if you use more registers per thread,
now you can't have as many threads.
And if you use more thread group memory per thread, again you run
into the same problem,
more thread your memory per thread means to your threads.
And you can actually check out the occupancy of your shader
by using MTLComputePipeLineState incurring
which will tell you what the actual occupancy
of your shader is based on the register usage
and the thread group memory usage.
And so when we say a shader is latency limited,
it means you have too few threads
to hide the latency of a shader.
And there's two things you can do there,
you can either reduce the latency of your shader,
your save registers or whatever else it is
that is preventing you from having more threads.
So, since it's kind of hard to go over latency
in a very large complex shader.
I'll go over a little bit of a pseudocode example
that will hopefully give you a big of an intuition of how
to think about latency and how to sort
of mentally model in your shaders.
So, here's an example of a REAL dependency.
We have a texture sample, and then we use the operative
of that texture sample to run an if statement
and then we do another texture sample inside that x statement.
We have to wait twice.
Because we have to wait once before doing the if statement.
And we have to wait again before using the value
from the second texture sample.
So that's two serial texture accesses
for a total of twice the latency.
Now here's an example of a false dependency.
It looks a lot like the other,
except we're not using a in the if statement.
But typically, we can't wait across control flow.
The if statement acts an effective barrier in this case.
So, we automatically have
to wait here anyways even though there's no data dependency.
So we still get twice the latency.
As you noticed the GPU does not actually care
about your data dependencies.
It only cares about what the dependencies appear to be
and so the second one will be just as long latency
as the first one, even though there isn't a data
And then finally here's a simple one
where you just have two texture reads at the top,
and they can both be done in parallel
and then we can have a single wait.
So it's 1 x instead of 2 x for latency.
So, what are you going to do with this knowledge?
So in many real world shaders you have opportunities
to tradeoff between latency and throughput.
And a common example of this might be that you have some code
where based on one texture read you can decide, oh we don't need
to do anything in this shader, we're going to quit early.
And that can be very useful.
Because now all that work that's being done in the cases
where you don't need it to be done,
you're saving all that work.
But now you're increasing your throughput
by reducing the amount of work you need to do.
But you're also increasing your latency because now it has
to do the first texture read, then wait for that texture read,
then do your early termination check,
and then do whatever other texture reads you have.
And well is it faster?
Is it not?
Often you just have to test.
Because which is faster is really going to depend
on your shader, but it's a thing worth being aware
of that often is a real tradeoff and you often have
to experiment to see what's right.
Now, while there isn't a universal rule,
there is one particular guideline I can give for A8
and later GPUs and that is typically the hardware needs
at least two texture reads at a time
to get full ability to hide latency.
One is not enough.
If you have to do one, no problem.
But if you have some choice
in how you arrange your texture reads in your shader,
if you allow it to do at least two at a time,
you may get better performance.
So, in summary.
Make sure you pick the correct address spaces, data structures,
layouts and so forth, because getting this wrong is going
to hurt so much that often none of the other stuff
in the presentation will matter.
Work with the compiler.
Write what you mean.
Don't try to be too clever,
or the compiler won't know what you mean and will get lost,
and won't be able to do its job.
Plus, it's easier to write what you mean.
Keep an eye out for the big pitfalls,
not just the micro-optimizations.
They're often not as obvious, and they often don't come
up as often, but when they do, they hurt.
And they will hurt so much that no number
of micro-optimizations will save you.
And feel free to experiment.
There's a number of rule tradeoffs that happen,
where there's simply no single rule.
And try them both, see what's faster.
So, if you want more information, go online.
The video of the talk will be up there.
Here are the other session if you missed them earlier, again,
the videos will be online.
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.