What is faster: computeEncoders vs renderEncoder

Question

Created Jan ’17

Replies 1

Boosts 0

Participants 2

Lets assume you want to convert a simple rgb texture into a b/w texture. This should be only an example so please ignore the fact that there are of cause other techniques for this simple task.

method 1: render encoder

You can render two triangles with texture mapping and implement the fragment shader so that it converts the pixel values to b/w and write it to another render buffer. That works well.

method 2: compute encoders

Here I use the same function like in the previous fragment shader as a kernel function and read/write the pixel values directly from one texture to another. This works also.

I was surprised that the compute encoder is much slower in my tests than the render encoder. I thought that because there is no vertex computation and rasterization involved the compute encoder should provide better performance.

For an simple image manipulation task like above that doesn't need the information of the surrounding pixels, is it here better to use fragment shaders instead of kernel functions?

Boost

Answer 1

MikeAlpha OP

Jan ’17

i have similar experiences with compute kernels. They seem to be generally slower on most (if not all) devices I use, namely both GPUs of 2012 rMBP (nVidia and Intel) as well as on iPad Pro 12.9. But the devil's in details:

1) Some tasks (for example scatter or communication between workers) are awkward/not possible when done via shaders and render pipeline. Compute is more natural here. Can be faster, too

2) I an under impression that iPad Pro can execute compute tasks parallel to render tasks. Therefore by moving part of work from overloaded renderer into compute kernel, one can still get better overall performance

3) it is harder to write (and schedule) optimal compute kernel than its shader counterpart. One is basically left with many hard decisions here, like threadgroup sizes. What is even more important is that these may vary between devices. I believe that serious performance work in CUDA or OpenCL freqently involves writing "autotuning" code.

4) Naive compute kernel (for example simple 1 to 1 mapping between data size and number of threads/threadgroups) is usually less efficient than say, similar kernel but processing several - like 2, 4 or 8 - items at once)

5) On the other hand, GPU and/or driver understands rendering better, so it may help with rendering in ways it can't with compute. And then you get device-specific optimization "for free". I guess this is even more importand with TBDRs with their specific tiling requirements.

6) I'd say that buffers are sometimes abused in compute kernels. Rendering usually uses textures and their caches help a lot.

7) Hierarchical structure of rendering (vertex, then fragment shaders) sometimes allows for huge optimizations by simply moving "common" calculations from fragment to vertex shaders. This depends on particulars, but usually a big win.

And in your particular example above, number of vertex shaders is very small, it is unlikely to have performance impact.

Best regards

Michal

PS. What I wrote above is based upon Metal experience. I have also experience with OpenCL/OpenGL on various other devices (older MBPs, standalone PC GPUs) and most of what I wrote above holds there, too. I remember writing CL kernels faster than GL shaders, but kernels were very painfully optimized AND GL had performance penalties when switching render targets, which hampered performance of GL shaders doing compute work.

0