Article

Calculating Threadgroup and Grid Sizes

Calculate the optimum sizes for threadgroups and grids when dispatching compute-processing workloads.

Overview

In iOS 11 and macOS 10.13 and later, when you're ready to execute your compute kernel code, you need to specify the size of the grid and the number of threads per threadgroup. Metal then calculates the number of threadgroups and provides nonuniform threadgroups if the grid size isn't a multiple of the threadgroup size. This ensures that you have no underutilized threads.

In earlier versions of iOS and macOS, you need to specify the size and number of the threadgroups. Because the grid is composed of uniform threadgroups, it may not match the size of your data, in which case you need to add defensive code to your compute kernel to ensure that it's not executing outside the bounds of the data.

Calculate Threads per Threadgroup

You calculate the number of threads per threadgroup based on two MTLComputePipelineState properties. One property is maxTotalThreadsPerThreadgroup (the maximum number of threads that can be in a single threadgroup). The other is threadExecutionWidth (the number of threads scheduled to execute in parallel on the GPU).

The maxTotalThreadsPerThreadgroup property is dependent on the device, the register usage of your compute kernel, and threadgroup memory usage. After a compute pipeline state has been created, its maxTotalThreadsPerThreadgroup value doesn't change, but two pipeline states on the same device may return different values.

The number of threads per threadgroup can't exceed maxTotalThreadsPerThreadgroup. On a device with a maxTotalThreadsPerThreadgroup value of 512 and a threadExecutionWidth of 32, a suitable number of threads per threadgroup is 32 (the thread execution width) x 16 (the total threads per threadgroup divided by the thread execution width). Listing 1 shows an example of defining a threadgroup's dimensions based on the thread execution width and maximum threads per threadgroup.

Listing 1

Calculating threads per threadgroup.

NSUInteger w = pipelineState.threadExecutionWidth;
NSUInteger h = pipelineState.maxTotalThreadsPerThreadgroup / w;
MTLSize threadsPerThreadgroup = MTLSizeMake(w, h, 1);

On devices that support non-uniform threadgroup sizes, Metal is able to calculate how the grid (in this case, an image or texture) can be optimally divided into nonuniform, arbitrarily sized threadgroups. The compute command encoder's dispatchThreads:threadsPerThreadgroup: method requires the total number of threads—each thread being responsible for a single pixel—and the threadsPerThreadgroup value calculated in Listing 1, as shown in the following example:

MTLSize threadsPerGrid = MTLSizeMake(texture.width, texture.height, 1);
    
[computeCommandEncoder dispatchThreads: threadsPerGrid
                       threadsPerThreadgroup: threadsPerThreadgroup];

When Metal performs this calculation, it can generate smaller threadgroups along the edges of your grid, as shown below. When compared to uniform threadgroups, this technique simplifies kernel code and improves GPU performance.

To determine if a device supports non-uniform threadgroups, see Metal Feature Set Tables.

Nonuniform threadgroups.

Calculate Threadgroups per Grid

If you need fine control over the size and number of threadgroups, you can manually calculate how the grid is divided. In your code, ensure that there are sufficient threadgroups to cover the entire image. Here's an example:

MTLSize threadgroupsPerGrid = MTLSizeMake((texture.width + w - 1) / w,
                                          (texture.height + h - 1) / h,
                                          1);

Given a texture of 1024 x 768, the code above returns a MTLSize object of 32 x 48 x 1, meaning the texture is divided into 1536 threadgroups, each containing 512 threads, for a total of 786,432 threads. In this case, that number matches the number of pixels in the image, and the entire image is processed with no underutilization of threads. However, this may not always be the case (for example, for an image size of 1920 x 1080). This code, by rounding up, ensures that there are sufficient threads to process the entire image.

With this approach, the grid generated by the threadgroups could be larger than your data. Therefore, your code should exit early if the thread position in the grid is outside the bounds of the data. The following illustration shows how a set of 4 x 4 threadgroups extends over the bounds of a grid, leading to underutilization of threads:

Uniform threadgroups leading to underutilized threads.

Listing 4 shows a simple kernel that writes opaque white to each pixel in outputTexture. It begins by comparing the thread position to the bounds of the texture and returns if the position is outside of the texture's extent.

Listing 4

Exiting early when out of bounds.

kernel void
simpleKernelFunction(texture2d<float, access::write> outputTexture [[texture(0)]],
                     uint2 position [[thread_position_in_grid]]) {
    
    if (position.x >= outputTexture.get_width() || position.y >= outputTexture.get_height()) {
        return;
    }
    
    outputTexture.write(float4(1.0), position);
}

Note that this check isn't required with the previous dispatchThreads:threadsPerThreadgroup: technique.

Using the code in Listing 4, the final dispatch would be:

[computeCommandEncoder dispatchThreadgroups: threadgroupsPerGrid
                       threadsPerThreadgroup: threadsPerThreadgroup];

See Also

Parallel Computation

Hello Compute

Demonstrates how to perform data-parallel computations using the GPU.

About Threads and Threadgroups

Learn how Metal organizes compute-processing workloads.

MTLComputePipelineDescriptor

An object used to customize how a new compute pipeline state object is compiled.

MTLComputePipelineState

An object that contains a compiled compute pipeline.

MTLComputeCommandEncoder

An object used to encode commands in a compute pass.