About Raster Order Groups

Learn about precisely controlling the order of parallel fragment shader threads accessing the same pixel coordinates.


Metal 2 introduces raster order groups that give ordered memory access from fragment shaders and simplify rendering techniques, such as order-independent transparency, dual-layer G-buffers, and voxelization.

Given a scene containing two overlapping triangles, Metal guarantees that blending happens in draw call order, giving the illusion that the triangles are rendered serially. Figure 1 shows a blue triangle partially occluded by a green triangle.

However, behind the scenes, the process is highly parallel; multiple threads are running concurrently, and there’s no guarantee that the fragment shader for the rear triangle has executed before the fragment shader for the front triangle. Figure 1 shows that although the two threads execute concurrently, the blending happens in draw call order.

Figure 1

Blending of two triangles in draw call order

Blending of two triangles in draw call order

A custom blend function in your fragment shader may need to read the results of the rear triangle’s fragment shader before applying that function based on the front triangle’s fragment. Because of concurrency, this read–modify–write sequence can create a race condition. Figure 2 shows thread 2 attempting to simultaneously read the same memory that thread 1 is writing.

Figure 2

Attempting to simultaneously read and write the same memory

Attempting to simultaneously read and write the same memory

Raster Order Groups for Overcoming Access Conflict

Raster order groups overcome this access conflict by synchronizing threads that target the same pixel coordinates and sample (if per-sample shading is activated). You implement raster order groups by annotating pointers to memory with an attribute qualifier. Access through those pointers is then done in a per-pixel submission order. The hardware waits for any older fragment shader threads that overlap the current thread to finish before the current thread proceeds.

Figure 3 shows how raster order groups synchronize both threads so that thread 2 waits until the write is complete before attempting to read that piece of memory.

Figure 3

Synchronized threads serially reading and writing the same memory

Synchronized threads serially reading and writing the same memory

Extended Raster Order Groups with Metal 2 on A11

Metal 2 on A11 extends raster order groups with additional capabilities. First, it allows synchronization of individual channels of an imageblock and threadgroup memory. Second, it allows for the creation of multiple order groups, giving you finer-grained synchronization and minimizing how often your threads wait for access.

An example of where the additional capabilities of raster order groups on the A11 graphics processing unit (GPU) improve performance is deferred shading. Traditionally, deferred shading requires two phases. The first phase fills a G-buffer and produces multiple textures. The second phase consumes those textures and calculates the shading results to render the light volumes, as shown in Figure 4.

Figure 4

Deferred shading implemented in two phases

Deferred shading implemented in two phases

Because the intermediate textures are written to and read from device memory, deferred shading is bandwidth intensive. The A11 GPU is able to leverage multiple order groups to coalesce both render phases into one, eliminating the need for the intermediate textures. Furthermore, it can keep the G-buffer in tile-sized chunks that remain in local imageblock memory.

To demonstrate how the A11 GPU’s multiple order groups can improve the performance of deferred shading, Figure 5 shows how a traditional GPU schedules threads for the lighting phase. The thread responsible for the second light must wait for access from prior threads to complete before it can begin. This wait forces the execution of these two threads to run serially, even if the accesses don’t conflict with each other.

Figure 5

Scheduling threads for a deferred shading lighting phase

Scheduling threads for a deferred shading lighting phase.

Figure 6 shows how multiple order groups allow you to run the nonconflicting reads concurrently, with the two threads synchronizing at the end of execution to accumulate the lights. You achieve this by declaring the three G-buffer fields—albedo, normal, and depth—to be in the first group, and the accumulated lighting result to be in the second group. The A11 GPU is able to order the two groups separately, and outstanding writes into the second group don’t require reads in the first group to wait.

Figure 6

Scheduling threads with raster order groups

Scheduling threads with raster order groups

With multiple order groups, more threads are eligible to run concurrently, allowing for more parallelism and improved performance.

See Also

GPU Family 4 Features

About Imageblocks

Learn how imageblocks allow you to define and manipulate custom per-pixel data structures in high-bandwidth tile memory.

About Tile Shading

Learn about combining rendering and compute operations into a single render pass while sharing local memory.

About Enhanced MSAA and Imageblock Sample Coverage Control

Learn about accessing multisample tracking data within a tile shader, enabling development of custom MSAA resolve algorithms, and more.

About Threadgroup Sharing

Learn about the enhanced memory model that allows for flexible and efficient sharing of data between threads.