Data-Parallel Compute Processing: Compute Command Encoder

This chapter explains how to create and use a MTLComputeCommandEncoder object to encode data-parallel compute processing state and commands and submit them for execution on a device.

To perform a data-parallel computation, follow these main steps:

  1. Use a MTLDevice method to create a compute state (MTLComputePipelineState) that contains compiled code from a MTLFunction object, as discussed in Creating a Compute State. The MTLFunction object represents a compute function written with the Metal shading language, as described in Functions and Libraries.

  2. Specify the MTLComputePipelineState object to be used by the compute command encoder, as discussed in Specifying a Compute State and Resources for a Compute Command Encoder.

  3. Specify resources and related objects (MTLBuffer, MTLTexture, and possibly MTLSamplerState) that may contain the data to be processed and returned by the compute state, as discussed in Specifying a Compute State and Resources for a Compute Command Encoder. Also set their argument table indices, so that Metal framework code can locate a corresponding resource in the shader code. At any given moment, the MTLComputeCommandEncoder can be associated to a number of resource objects.

  4. Dispatch the compute function a specified number of times, as explained in Executing a Compute Command.

Creating a Compute Pipeline State

A MTLFunction object represents data-parallel code that can be executed by a MTLComputePipelineState object. The MTLComputeCommandEncoder object encodes commands that set arguments and execute the compute function. Because creating a compute pipeline state can require an expensive compilation of Metal shading language code, you can use either a blocking or an asynchronous method to schedule such work in a way that best fits the design of your app.

When you create a MTLComputePipelineState object you can also choose to create reflection data that reveals details of the compute function and its arguments. The newComputePipelineStateWithFunction:options:reflection:error: and newComputePipelineStateWithFunction:options:completionHandler: methods provide this data. Avoid obtaining reflection data if it will not be used. For more information on how to analyze reflection data, see Determining Function Details at Runtime.

Specifying a Compute State and Resources for a Compute Command Encoder

The setComputePipelineState: method of a MTLComputeCommandEncoder object specifies the state, including a compiled compute shader function, to use for a data-parallel compute pass. At any given moment, a compute command encoder can be associated to only one compute function.

The following MTLComputeCommandEncoder methods specify a resource (that is, a buffer, texture, sampler state, or threadgroup memory) that is used as an argument to the compute function represented by the MTLComputePipelineState object.

Each method assigns one or more resources to the corresponding argument(s), as illustrated in Figure 6-1.

Figure 6-1  Argument Tables for the Compute Command Encoder

The limits for the maximum number of entries in a buffer, texture, or sampler state argument table are listed in the Implementation Limits table.

The limits for the maximum total threadgroup memory allocation is also listed in the Implementation Limits table.

Executing a Compute Command

To encode a command to execute a compute function, call the dispatchThreadgroups:threadsPerThreadgroup: method of MTLComputeCommandEncoder and specify the threadgroup dimensions and the number of threadgroups. You can query the threadExecutionWidth and maxTotalThreadsPerThreadgroup properties of MTLComputePipelineState to optimize the execution of the compute function on this device.

The total number of threads in a threadgroup is the product of the components of threadsPerThreadgroup: threadsPerThreadgroup.width * threadsPerThreadgroup.height * threadsPerThreadgroup.depth. The maxTotalThreadsPerThreadgroup property specifies the maximum number of threads that can be in a single threadgroup to execute this compute function on the device.

Compute commands are executed in the order in which they are encoded into the command buffer. A compute command finishes execution when all threadgroups associated with the command finish execution and all results are written to memory. Because of this sequencing, the results of a compute command are available to any commands encoded after it in the command buffer.

To end encoding commands for a compute command encoder, call the endEncoding method of MTLComputeCommandEncoder. After ending the previous command encoder, you can create a new command encoder of any type to encode additional commands into the command buffer.

Code Example: Executing Data-Parallel Functions

Listing 6-1 shows an example that creates and uses a MTLComputeCommandEncoder object to perform the parallel computations of an image transformation on specified data. (This example does not show how the device, library, command queue, and resource objects are created and initialized.) The example creates a command buffer and then uses it to create the MTLComputeCommandEncoder object. Next a MTLFunction object is created that represents the entry point filter_main loaded from the MTLLibrary object, shown in Listing 6-2. Then the function object is used to create a MTLComputePipelineState object called filterState.

The compute function performs an image transformation and filtering operation on the image inputImage with the results returned in outputImage. First the setTexture:atIndex: and setBuffer:offset:atIndex: methods assign texture and buffer objects to indices in the specified argument tables. paramsBuffer specifies values used to perform the image transformation, and inputTableData specifies filter weights. The compute function is executed as a 2D threadgroup of size 16 x 16 pixels in each dimension. The dispatchThreadgroups:threadsPerThreadgroup: method enqueues the command to dispatch the threads executing the compute function, and the endEncoding method terminates the MTLComputeCommandEncoder. Finally, the commit method of MTLCommandBuffer causes the commands to be executed as soon as possible.

Listing 6-1  Specifying and Running a Function in a Compute State

id <MTLDevice> device;
id <MTLLibrary> library;
id <MTLCommandQueue> commandQueue;
id <MTLTexture> inputImage;
id <MTLTexture> outputImage;
id <MTLTexture> inputTableData;
id <MTLBuffer> paramsBuffer;
// ... Create and initialize device, library, queue, resources
// Obtain a new command buffer
id <MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
// Create a compute command encoder
id <MTLComputeCommandEncoder> computeCE = [commandBuffer computeCommandEncoder];
NSError *errors;
id <MTLFunction> func = [library newFunctionWithName:@"filter_main"];
id <MTLComputePipelineState> filterState
              = [device newComputePipelineStateWithFunction:func error:&errors];
[computeCE setComputePipelineState:filterState];
[computeCE setTexture:inputImage atIndex:0];
[computeCE setTexture:outputImage atIndex:1];
[computeCE setTexture:inputTableData atIndex:2];
[computeCE setBuffer:paramsBuffer offset:0 atIndex:0];
MTLSize threadsPerGroup = {16, 16, 1};
MTLSize numThreadgroups = {inputImage.width/threadsPerGroup.width,
                           inputImage.height/threadsPerGroup.height, 1};
[computeCE dispatchThreadgroups:numThreadgroups
[computeCE endEncoding];
// Commit the command buffer
[commandBuffer commit];

Listing 6-2 shows the corresponding shader code for the preceding example. (The functions read_and_transform and filter_table are placeholders for user-defined code).

Listing 6-2  Shading Language Compute Function Declaration

kernel void filter_main(
  texture2d<float,access::read>   inputImage   [[ texture(0) ]],
  texture2d<float,access::write>  outputImage  [[ texture(1) ]],
  uint2 gid                                    [[ thread_position_in_grid ]],
  texture2d<float,access::sample> table        [[ texture(2) ]],
  constant Parameters* params                  [[ buffer(0) ]]
  float2 p0          = static_cast<float2>(gid);
  float3x3 transform = params->transform;
  float4   dims      = params->dims;
  float4 v0 = read_and_transform(inputImage, p0, transform);
  float4 v1 = filter_table(v0,table, dims);