Hi
I just so happens, that I had similar problem a while ago. In my case it was sparce texture transformation. Big (>8k) texture, subdiviced into 100ks of rectangular "regions" and some of these had to undergo transformation, while others did not. I have tested three configurations:
- Running everything regardless of necessity. This was slowest (duh)
- Generating triangular mesh on fly using vertex_id in vertex shader (two triangles for each region) and culling away not needed regions (moved them out of screen, if I remember correctly). This was much faster, but still included significant vertex shader overhead (as vertex shader had to be run for each two triangles of each region)
- Final solution was to generate indices and process only regions that have changed. This I did not by atomic functions, but by parallel-prefix-scan algorithm (Hillis & Steele). This final solution was fastest, but generating indices had its own cost, too (in the order of 1ms for 100k of regions).
Now, I don't know if I understand your problem completely. But if all points are in some sperical region, why won't you try and execute your compute kernel over some rectangular region that is work-group aligned and encloses spherical region? That should be trivial to do, you'd just need to add "initial offset" to your compute kernel and work on thread_id + initial_offset instead of thread_id.
It that fails to bring enough improvements, well, "indices" solution brought good speedup for me. But I don't know if using atomic operations will be efficient enough, and parallel-prefix-scan is a bit of programming work in itself.
Good luck!
Michal