Learn about the enhanced memory model that allows for flexible and efficient sharing of data between threads.
Metal 2 on A11 introduces a new memory model that adopts and extends the C++11 consistency model. This model has new capabilities that allow both threadgroups and the threads within a threadgroup to communicate with each other using atomic operations or a memory fence rather than expensive barriers.
An example of where threadgroups need to communicate is a kernel that sums an array of floats, as shown in Figure 1. Traditionally, you might implement this summation with a kernel that computes the sum of values per threadgroup and writes those values to an intermediate buffer. Because those threadgroups can’t communicate, you’d need to dispatch a second kernel that computes a final sum of the values of the intermediate buffer.
Because there’s a cost to launch each kernel, this approach may not be efficient. Also, because the second kernel uses a single threadgroup, it may not fully utilize the graphics processing unit (GPU).
With threadgroup sharing, one kernel and one dispatch can sum every element in the input array. You can use an atomic operation to calculate the number of completed threadgroups. When all the threadgroups have completed, the last executing threadgroup can compute the final sum of sums, as shown in Figure 2.
Metal 2 on A11 introduces atomic functions that allow mutually exclusive access to a memory location and allow you to specify how memory is synchronized between threads within or across threadgroups. You specify memory order and memory scope for each atomic operation.
Memory order is used to specify how memory operations are ordered around a synchronization operation. Memory order can be relaxed—this is the fastest mode and provides a guarantee of exclusive access to atomic operations only. If you need to synchronize data between threads, use acquire-release memory order. In this mode, a thread writing to memory performs a release to allow threads to acquire the same memory and read the latest data.
Memory scope is based on GPU memory hierarchy and specifies whether the atomic operation needs to be synchronized between the threads in a SIMD group, threadgroup, or device.