This is an interesting thread, thanks for the orig post. I was under the impression that Metal was almost always faster than OpenCL because OpenCL is and old and clunky higher level API, but Metal goes right to the metal as they say. I guess not? I would suggest looking into the global memory access times if you can, e.g. the read-write refs to the device declared arrays. I am unsure of the latest tools for this, but you can try to check into the problem by removing device arrays from the picture and just doing fake calculations. E.g. try to set up some local arrays in thread or threadgroup storage instead of device storage for the arrays xyin, xyout and spherical_params (see https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf section 4.3) and dont even bother to initialize them. Just let the array dereferences and calculations stay the same in your kernel and let it run on bogus float data in those threadgroup arrays. See if it is faster. It should be a lot faster. If it is fa
Topic:
Graphics & Games
SubTopic:
General
Tags: