Should I use a compute pipeline for this

I have just started to dive into exactly what a compute shader is and what kinds of things it can do. One of the examples given was converting an image to greyscale which got me thinking. In my OpenGL ES pipeline there were a few passes where it was essentially image processing with one texture going into a shader that was just rendering a full screen quad for things like blurring an image and fading extermely dark colors.


It has occured to me that these goals could potentially be better done with a compute pipeline.


Can I generalize that a task is usually better suited for a compute pipeline if it does not technically involve rendering triangles or are there cases such as texture sampling where a render pipeline still has an edge?

Answered by MikeAlpha in 286128022

In my opinion, biggest difference between render and compute pipelines is domain control.


I. Compute

With compute, you just have one (two, three) dimensional grid, and the compute kernel is going to be invoked on every point in grid. You also get to control dicing up the domain into thread groups, threads and so on. This allows you to use some extra features like "threadgroup" address space for fast communication between threads in same threadgroup by means of special local memory, inaccessible from graphics functions. This in turn may help making certain types of algorithms run faster (like parallel-prefix-scan or "stencil" type computation that is often used in graphic filters or CFD simulations).


On the other hand, dividing the domain is additional task and not a trivial one if you care about the performance. Especially so that "perfect" solution depends on device. So if you want something to run as fast as possible you really should prepare several versions of the compute kernels (it is usually more efficient to assign more than one "cell" of problem to one "thread" of metal compute kernel than do 1-to-1 mapping), and maybe prepare some "autotune" code trying out several combinations of kernels and threadgroup sizes to find what works best on given device.


Note that samplers CAN be used in compute functions.


II. Render

Now for render you just describe geometric primitives (points, triangles and so on) and you don't have to worry about threadgroups, threads and all this stuff. This is better suited for drawing, because Metal/driver will likely come with better approach than ordinary programmer can. But two level (I'll leave out tesselation shaders) structure of typical render pipeline gives you also some programmable, conditional control over what gets processed.


For example, I once had compute kernel spanning big texture (very close to 16K/memory limits of the device). But the processing wasn't really done on the whole texture, it was occuring in some areas (depending on input data, which was another texture). And I couldn't optimise this compute kernel below some 1/10th of a second. So I diced up the whole area into triangle mesh, every two triangles forming quad 128 by 128 texels. And in every vertex shader check was performed whether computation is needed in this area or not, and if it wasn't needed, Z-coord was altered to put triangles in question outside of <near,far> clipping range. This in turn gave huge speedup, because computations were done only where needed, and not on the whole domain.


On the other hand, render setup is more tedious because you have to write at least two shaders, render pipeline setup is more work than for a compute one. And using render for compute can sometimes be super awkward, with floating point coordinates being converted back and forth to what really is some discrete memory indices.


Hope that helps a bit.

Accepted Answer

In my opinion, biggest difference between render and compute pipelines is domain control.


I. Compute

With compute, you just have one (two, three) dimensional grid, and the compute kernel is going to be invoked on every point in grid. You also get to control dicing up the domain into thread groups, threads and so on. This allows you to use some extra features like "threadgroup" address space for fast communication between threads in same threadgroup by means of special local memory, inaccessible from graphics functions. This in turn may help making certain types of algorithms run faster (like parallel-prefix-scan or "stencil" type computation that is often used in graphic filters or CFD simulations).


On the other hand, dividing the domain is additional task and not a trivial one if you care about the performance. Especially so that "perfect" solution depends on device. So if you want something to run as fast as possible you really should prepare several versions of the compute kernels (it is usually more efficient to assign more than one "cell" of problem to one "thread" of metal compute kernel than do 1-to-1 mapping), and maybe prepare some "autotune" code trying out several combinations of kernels and threadgroup sizes to find what works best on given device.


Note that samplers CAN be used in compute functions.


II. Render

Now for render you just describe geometric primitives (points, triangles and so on) and you don't have to worry about threadgroups, threads and all this stuff. This is better suited for drawing, because Metal/driver will likely come with better approach than ordinary programmer can. But two level (I'll leave out tesselation shaders) structure of typical render pipeline gives you also some programmable, conditional control over what gets processed.


For example, I once had compute kernel spanning big texture (very close to 16K/memory limits of the device). But the processing wasn't really done on the whole texture, it was occuring in some areas (depending on input data, which was another texture). And I couldn't optimise this compute kernel below some 1/10th of a second. So I diced up the whole area into triangle mesh, every two triangles forming quad 128 by 128 texels. And in every vertex shader check was performed whether computation is needed in this area or not, and if it wasn't needed, Z-coord was altered to put triangles in question outside of <near,far> clipping range. This in turn gave huge speedup, because computations were done only where needed, and not on the whole domain.


On the other hand, render setup is more tedious because you have to write at least two shaders, render pipeline setup is more work than for a compute one. And using render for compute can sometimes be super awkward, with floating point coordinates being converted back and forth to what really is some discrete memory indices.


Hope that helps a bit.

Yes, that does help quite a bit for sure! Especially thank you for that thoughtful example of the regional computes on your large texture.



Could you elaborate a bit more on the samplers and compute part? I am not sure how that would work since generally a thread is designated for one pixel? Would that not open up threading issues with multiple threads wanting to read the same texel to perform linear sampling?



I was, for example, thinking about the MPSGaussian shader which I am assuming is compute backed. The thing about that process is each pixel is essentially the weighted sum of the pixels surrounding it so you cant really break the image into nice chunks because the border of the chunk would need to read from the adjacent chunk.

It's fine for multiple threads to _read_ from the same memory location at the same time. What you need to be careful with is _writing_ to the same memory location.

"generally a thread is designated for one pixel"

Nope. Thread is more like (with caveats) "single instance of compute kernel running". And that's it. What it does is completely up to the code you write. Specifically, it can have nothing to do with graphics at all. For example I wrote particle system implementation once, where particles had "lifes". So I had compute kernels for "particle system update", which involved going through the buffer containing particles and decrementing their lifes. Then, particles with nonzero "lifes" were copied into the second buffer, but "filling in" the gaps left by "dead" particles (this was using parallel-prefix-scan primitive) and the final number of particles was computed and set into third buffer, for use with "draw indirect".


The way I think of it, compute kernels have more to do with parallel computing models/paradigms such as MPI than with computer graphics. This is abstraction. You have this grid, which can be 1, 2 or 3d. You get to control how many dimensions you want, and also how many threads are there in every dimension. Of course, 2d grid maps naturally to 2d pixels (or texels or fragments, or whatever), but it doesn't have to be that. It could be some matrix, or whatever two dimensional "thingy" there is. 1d grid is often used for mapping into memory - after all, usual memory layout is kinda like 1d grid, from first address (grid cell 0) to memory size (grid cell memory size - 1).


Now in a kernel there are ways to get info as to which particular thread you're executing right now. And it is up to your code, to do with that information however you please. For example, if you're doing image manipulation, you could have 2d grid, and map i, j coords into pixel i, j. But you could also (and it is often done, for performance reasons) create grid sized (your image width / 4, your image height), and in thread i, j process pixels i *4, i * 4 + 1, i * 4 + 2 and I * 4 + 3. Mapping FROM grid into your particular problem is your responsibility. Grid is just a way for scheduler to "measure the size" of your problem, to know how to invoke the kernels, that's it.


Then there is issue of thread groups. This is more of implementation issue. You see, "threads" here (as compared to CPU threads) are really misleading. What GPU really has is quite wide (16 or 32 usually) SIMD units. And most GPUs have several. So there are these simple processing units which fetch instructions from single source, but execute them in parallel over several sets of registers, in effect maintaining several "state sets" so to speak.


A bit about this is here: https://developer.apple.com/library/content/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Compute-Ctx/Compute-Ctx.html#//apple_ref/doc/uid/TP40014221-CH6-SW1

Also here: https://developer.apple.com/documentation/metal/compute_processing/calculating_threadgroup_and_grid_sizes

You may also be interested in my answer here https://forums.developer.apple.com/thread/77958

Thanks! I am going to have to read through those links a BUNCH more times before I start to wrap my head around this. Its a great start though!

Also, with respect to the compute, Metal is quite similar to OpenCL and CUDA. Terminology is different, and it can be super confusing, but algorithms and all that are basically the same (it can be the same hardware, after all). So articles like for example Mark Harris' wrote on parallel programming primitives, or Vasily Volkov's performance tuning stuff can be easily applied to Metal compute as well.

Should I use a compute pipeline for this
 
 
Q