Calculate the optimum sizes for threadgroups and grids when dispatching compute-processing workloads.
In iOS 11 and macOS 10.13 and later, when you're ready to execute your compute kernel code, you need to specify the size of the grid and the number of threads per threadgroup. Metal then calculates the number of threadgroups and provides nonuniform threadgroups if the grid size isn't a multiple of the threadgroup size. This ensures that you have no underutilized threads.
In earlier versions of iOS and macOS, you need to specify the size and number of the threadgroups. Because the grid is composed of uniform threadgroups, it may not match the size of your data, in which case you need to add defensive code to your compute kernel to ensure that it's not executing outside the bounds of the data.
Calculate Threads per Threadgroup
You calculate the number of threads per threadgroup based on two
MTLCompute properties. One property is
max (the maximum number of threads that can be in a single threadgroup). The other is
thread (the number of threads scheduled to execute in parallel on the GPU).
max property is dependent on the device, the register usage of your compute kernel, and threadgroup memory usage. After a compute pipeline state has been created, its
max value doesn't change, but two pipeline states on the same device may return different values.
The number of threads per threadgroup can't exceed
max. On a device with a
max value of 512 and a
thread of 32, a suitable number of threads per threadgroup is 32 (the thread execution width) x 16 (the total threads per threadgroup divided by the thread execution width). Listing 1 shows an example of defining a threadgroup's dimensions based on the thread execution width and maximum threads per threadgroup.
On devices that support non-uniform threadgroup sizes, Metal is able to calculate how the grid (in this case, an image or texture) can be optimally divided into nonuniform, arbitrarily sized threadgroups. The compute command encoder's
dispatch method requires the total number of threads—each thread being responsible for a single pixel—and the
threads value calculated in Listing 1, as shown in the following example:
When Metal performs this calculation, it can generate smaller threadgroups along the edges of your grid, as shown below. When compared to uniform threadgroups, this technique simplifies kernel code and improves GPU performance.
To determine if a device supports non-uniform threadgroups, see Metal Feature Set Tables.
Calculate Threadgroups per Grid
If you need fine control over the size and number of threadgroups, you can manually calculate how the grid is divided. In your code, ensure that there are sufficient threadgroups to cover the entire image. Here's an example:
Given a texture of 1024 x 768, the code above returns a
MTLSize object of 32 x 48 x 1, meaning the texture is divided into 1536 threadgroups, each containing 512 threads, for a total of 786,432 threads. In this case, that number matches the number of pixels in the image, and the entire image is processed with no underutilization of threads. However, this may not always be the case (for example, for an image size of 1920 x 1080). This code, by rounding up, ensures that there are sufficient threads to process the entire image.
With this approach, the grid generated by the threadgroups could be larger than your data. Therefore, your code should exit early if the thread position in the grid is outside the bounds of the data. The following illustration shows how a set of 4 x 4 threadgroups extends over the bounds of a grid, leading to underutilization of threads:
Listing 4 shows a simple kernel that writes opaque white to each pixel in
output. It begins by comparing the thread position to the bounds of the texture and returns if the position is outside of the texture's extent.
Note that this check isn't required with the previous
Using the code in Listing 4, the final dispatch would be: