Knowing GPU architecture for better compute programs.

Hi, I have a CUDA program that I want to convert to Metal Compute so that we can support Apple hardware. When I wrote the CUDA version, I was able to write efficient code because I learned first about the Cuda-core architecture. The way the cores can access memory for instance is very important information so that I could write code that efficiently access the memory. Now I want to do the same for the Metal Compute software. But I can not find any information about the low level architecture and especially the things you should know to be able to write efficient code. Do I miss something? Is there some guide giving hints for the most efficient way to access memory for instance?