Learn about precisely controlling the order of parallel fragment shader threads accessing the same pixel coordinates.
Metal 2 introduces raster order groups that give ordered memory access from fragment shaders and simplify rendering techniques, such as order-independent transparency, dual-layer G-buffers, and voxelization.
Given a scene containing two overlapping triangles, Metal guarantees that blending happens in draw call order, giving the illusion that the triangles are rendered serially. Figure 1 shows a blue triangle partially occluded by a green triangle.
However, behind the scenes, the process is highly parallel; multiple threads are running concurrently, and there’s no guarantee that the fragment shader for the rear triangle has executed before the fragment shader for the front triangle. Figure 1 shows that although the two threads execute concurrently, the blending happens in draw call order.
A custom blend function in your fragment shader may need to read the results of the rear triangle’s fragment shader before applying that function based on the front triangle’s fragment. Because of concurrency, this read–modify–write sequence can create a race condition. Figure 2 shows thread 2 attempting to simultaneously read the same memory that thread 1 is writing.
Raster Order Groups for Overcoming Access Conflict
Raster order groups overcome this access conflict by synchronizing threads that target the same pixel coordinates and sample (if per-sample shading is activated). You implement raster order groups by annotating pointers to memory with an attribute qualifier. Access through those pointers is then done in a per-pixel submission order. The hardware waits for any older fragment shader threads that overlap the current thread to finish before the current thread proceeds.
Figure 3 shows how raster order groups synchronize both threads so that thread 2 waits until the write is complete before attempting to read that piece of memory.
Extended Raster Order Groups with Metal 2 on A11
Metal 2 on A11 extends raster order groups with additional capabilities. First, it allows synchronization of individual channels of an imageblock and threadgroup memory. Second, it allows for the creation of multiple order groups, giving you finer-grained synchronization and minimizing how often your threads wait for access.
An example of where the additional capabilities of raster order groups on the A11 graphics processing unit (GPU) improve performance is deferred shading. Traditionally, deferred shading requires two phases. The first phase fills a G-buffer and produces multiple textures. The second phase consumes those textures and calculates the shading results to render the light volumes, as shown in Figure 4.
Because the intermediate textures are written to and read from device memory, deferred shading is bandwidth intensive. The A11 GPU is able to leverage multiple order groups to coalesce both render phases into one, eliminating the need for the intermediate textures. Furthermore, it can keep the G-buffer in tile-sized chunks that remain in local imageblock memory.
To demonstrate how the A11 GPU’s multiple order groups can improve the performance of deferred shading, Figure 5 shows how a traditional GPU schedules threads for the lighting phase. The thread responsible for the second light must wait for access from prior threads to complete before it can begin. This wait forces the execution of these two threads to run serially, even if the accesses don’t conflict with each other.
Figure 6 shows how multiple order groups allow you to run the nonconflicting reads concurrently, with the two threads synchronizing at the end of execution to accumulate the lights. You achieve this by declaring the three G-buffer fields—albedo, normal, and depth—to be in the first group, and the accumulated lighting result to be in the second group. The A11 GPU is able to order the two groups separately, and outstanding writes into the second group don’t require reads in the first group to wait.
With multiple order groups, more threads are eligible to run concurrently, allowing for more parallelism and improved performance.