sparse compute kernel

I have a compute kernel which only modifies a small number of points in a buffer (points within some spherical region). The transformation within the conditional is somewhat expensive.


Would it, generally speaking, be better to first generate the affected indices in an initial pass (using atomic operations to append to a buffer), and then pass them to the second pass which computes teh transformation via dispatchThreadgroups indirectBuffer?


thanks!

Hi


I just so happens, that I had similar problem a while ago. In my case it was sparce texture transformation. Big (>8k) texture, subdiviced into 100ks of rectangular "regions" and some of these had to undergo transformation, while others did not. I have tested three configurations:

- Running everything regardless of necessity. This was slowest (duh)

- Generating triangular mesh on fly using vertex_id in vertex shader (two triangles for each region) and culling away not needed regions (moved them out of screen, if I remember correctly). This was much faster, but still included significant vertex shader overhead (as vertex shader had to be run for each two triangles of each region)

- Final solution was to generate indices and process only regions that have changed. This I did not by atomic functions, but by parallel-prefix-scan algorithm (Hillis & Steele). This final solution was fastest, but generating indices had its own cost, too (in the order of 1ms for 100k of regions).


Now, I don't know if I understand your problem completely. But if all points are in some sperical region, why won't you try and execute your compute kernel over some rectangular region that is work-group aligned and encloses spherical region? That should be trivial to do, you'd just need to add "initial offset" to your compute kernel and work on thread_id + initial_offset instead of thread_id.


It that fails to bring enough improvements, well, "indices" solution brought good speedup for me. But I don't know if using atomic operations will be efficient enough, and parallel-prefix-scan is a bit of programming work in itself.


Good luck!

Michal

Thanks for the reply! I'm not sure how to apply the prefix scan to this problem. What would the binary operator for the scan be?


To answer your question: I can't simply execute over a rectangular region because the points are scattered data (just an array of vertices). I'm currently working on binning them in a regular grid so that I can execute over a region.

I do it following way: first, I scan over all regions and assign binary weights to all of them. 0s to ones that don't have to be processed, and 1s to ones that should. Then I do parallel prefix scan with ordinary integer plus operator over weights, into separate table, lets say 'indices'. After this step if weight[ i ] == 1 then indices[ i ] is offset at which I can put 'i', and indices[ last + 1 ] is number of nonzero weights. Like this:

offsets:

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

weights:

| 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |

indices

| 0 | 0 | 1 | 1 | 1 | 1 | 2 | 3 | 3

so there are 3 nonzero weights and final "work" set of index is

| 1 | 5 | 6 |


Hope that helps

Michal

Got it! Really appreciate the explanation! That's quite clever.

I forgot to add that one has to generate indirect draw (for render pipeline) or indirect compute parameters of of that, so as to fire just enough vertex shaders/compute kernels.


I had this OpenGL particle system using geometry shaders for removing "dead" or adding more (splitting) particles. When I ported that to Metal, I came up with this way of emulating geometry shader. Weight == 0 for things you want to remove, == 1 for things you want to keep, and > 1 when you want to add something in the middle of the buffer.


If you're investigating vertex shader/fragment shader alternative as well check out neighbouring thread "Poor points rendering performance" by Marek Simonik. I now believe that in certain cases grouping of objects drawn can impact performance on iOS.

sparse compute kernel
 
 
Q