Hi - I'm trying to work out whether it is better to use a single metal function in a compute pipeline, or whether to split the function into multiple parts.
For context, my code involves tracing NRAYS for NSTAR locations through a reflecting telescope and calculating the resulting star shapes as they hit the detector.
I can trace all of the rays for each star in a single metal compute function (one thread per ray, so NRAYS * NSTARS)- but then for each star, I need to work out the average location of the ray as they hit the detector. I could try and make NRAYS equal to the maxTotalThreadsPerThreadgroup, and then use threadgroup_barrier to ensure that all rays for a star have been traced before averaging.
Alternatively, I could break the code into several parts, though still all in one command encoder. That way I can vary the number of rays as I wish ( I may need more than 1024 rays to get a good star shape).
However, this is my first experience of programming GPGPU code and don't yet have any feel for the relative timings for each method. If the first method was going to be a lot faster, I could accept the constraint of not being able to vary the number of rays.
Any advice gratefully received.
Thank you!
Colin