Article

About Threadgroup Sharing

Learn about the enhanced memory model that allows for flexible and efficient sharing of data between threads.

Overview

Metal 2 on A11 introduces a new memory model that adopts and extends the C++11 consistency model. This model has new capabilities that allow both threadgroups and the threads within a threadgroup to communicate with each other using atomic operations or a memory fence rather than expensive barriers.

An example of where threadgroups need to communicate is a kernel that sums an array of floats, as shown in Figure 1. Traditionally, you might implement this summation with a kernel that computes the sum of values per threadgroup and writes those values to an intermediate buffer. Because those threadgroups can’t communicate, you’d need to dispatch a second kernel that computes a final sum of the values of the intermediate buffer.

Because there’s a cost to launch each kernel, this approach may not be efficient. Also, because the second kernel uses a single threadgroup, it may not fully utilize the graphics processing unit (GPU).

Figure 1

Using two kernels to sum an array of floats

Using two kernels to sum an array of floats

With threadgroup sharing, one kernel and one dispatch can sum every element in the input array. You can use an atomic operation to calculate the number of completed threadgroups. When all the threadgroups have completed, the last executing threadgroup can compute the final sum of sums, as shown in Figure 2.

Figure 2

Using one kernel to sum an array of floats

Using one kernel to sum an array of floats

Metal 2 on A11 introduces atomic functions that allow mutually exclusive access to a memory location and allow you to specify how memory is synchronized between threads within or across threadgroups. You specify memory order and memory scope for each atomic operation.

Memory order is used to specify how memory operations are ordered around a synchronization operation. Memory order can be relaxed—this is the fastest mode and provides a guarantee of exclusive access to atomic operations only. If you need to synchronize data between threads, use acquire-release memory order. In this mode, a thread writing to memory performs a release to allow threads to acquire the same memory and read the latest data.

Memory scope is based on GPU memory hierarchy and specifies whether the atomic operation needs to be synchronized between the threads in a SIMD group, threadgroup, or device.

See Also

GPU Family 4 Features

About Imageblocks

Learn how imageblocks allow you to define and manipulate custom per-pixel data structures in high-bandwidth tile memory.

About Tile Shading

Learn about combining rendering and compute operations into a single render pass while sharing local memory.

About Raster Order Groups

Learn about precisely controlling the order of parallel fragment shader threads accessing the same pixel coordinates.

About Enhanced MSAA and Imageblock Sample Coverage Control

Learn about accessing multisample tracking data within a tile shader, enabling development of custom MSAA resolve algorithms, and more.