High Kernel Dispatch Overhead for Metal for Swift

I'm implementing a bitonic sort in Metal with a Swift app. This requires 100's kernel dispatch calls for each of the swap stages which touch the whole array, the work required by the GPU is small. I haven't been able to get this to run fast enough in Swift and it seems its due to a high overhead for each dispatchThread command. I rewrote the test program in Objective C with a super-simple kernel function and its runs 25x faster from Objective C!

Kernel function

kernel void fill(device uint8_t *array [[buffer(0)]],
                 const device uint32_t &N [[buffer(1)]],
                 const device uint8_t &value [[buffer(2)]],
                 uint i [[thread_position_in_grid]])
{
   if (i < N) {
      array[i] = value;
   }
}

The Swift code is:

func fill(pso:MTLComputePipelineState, buffer:MTLBuffer, N: Int, passes: Int) {
   guard let commandBuffer = commandQueue.makeCommandBuffer() else { return }
   let gridSize = MTLSizeMake(N, 1, 1)
   var threadGroupSize = pso.maxTotalThreadsPerThreadgroup
   if (threadGroupSize > N) {
      threadGroupSize = N;
   }
   let threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);
   for pass in 0..<passes {
      guard let computeEncoder = commandBuffer.makeComputeCommandEncoder() else { return }
      var value:UInt8 = UInt8(pass);
      var NN:UInt32 = UInt32(N);
      computeEncoder.setComputePipelineState(pso)
      computeEncoder.setBuffer(buffer, offset: 0, index: 0)
      computeEncoder.setBytes(&NN, length: MemoryLayout<UInt32>.size, index: 1)
      computeEncoder.setBytes(&value, length: MemoryLayout<UInt8>.size, index: 2)
      computeEncoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)
      computeEncoder.endEncoding()
   }
   commandBuffer.commit()
   commandBuffer.waitUntilCompleted()
}

let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let commandQueue = device.makeCommandQueue()!
let funcFill = library.makeFunction(name: "fill")!
let pso = try? device.makeComputePipelineState(function: funcFill)

var N = 16384
let passes = 100
let buffer = device.makeBuffer(length:N, options: [.storageModePrivate])!

for _ in 1...10 {
   let startTime = DispatchTime.now()
   fill(pso:pso!, buffer:buffer, N:N, passes:passes)
   let endTime = DispatchTime.now()
   let elapsedTime = endTime.uptimeNanoseconds - startTime.uptimeNanoseconds
   print("Elapsed time:", Float(elapsedTime)/1_000_000, "ms");
}

and the Objective C code (which should be almost identical) is

void fill(id<MTLCommandQueue> commandQueue,
          id<MTLComputePipelineState> funcPSO,
          id<MTLBuffer> A,
          uint32_t N,
          int passes) {
  id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
  MTLSize gridSize = MTLSizeMake(N, 1, 1);
  NSUInteger threadGroupSize = funcPSO.maxTotalThreadsPerThreadgroup;
  if (threadGroupSize > N) {
      threadGroupSize = N;
  }
  MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);
  for(uint8_t pass=0; pass<passes; pass++)
  {
      id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
      [computeEncoder setComputePipelineState:funcPSO];
      [computeEncoder setBuffer:A offset:0 atIndex:0];
      [computeEncoder setBytes:&N length:sizeof(uint32_t) atIndex:1];
      [computeEncoder setBytes:&pass length:sizeof(uint8_t) atIndex:2];
      [computeEncoder dispatchThreads:gridSize threadsPerThreadgroup:threadgroupSize];
      [computeEncoder endEncoding];
  }
  [commandBuffer commit];
  [commandBuffer waitUntilCompleted];
}


int main() {
  NSError *error;
  id<MTLDevice> device = MTLCreateSystemDefaultDevice();
  id<MTLLibrary> library = [device newDefaultLibrary];
  id<MTLCommandQueue> commandQueue = [device newCommandQueue];

  id<MTLFunction> funcFill = [library newFunctionWithName:@"fill"];
  id<MTLComputePipelineState> pso = [device newComputePipelineStateWithFunction:funcFill error:&error];

  // Prepare data
  int N = 16384;
  int passes = 100;
  id<MTLBuffer> bufferA = [device newBufferWithLength:N options:MTLResourceStorageModePrivate];

  for(int it=1; it<=10; it++)
  {
    CFTimeInterval startTime = CFAbsoluteTimeGetCurrent();
    fill(commandQueue, pso, bufferA, N, passes);
    CFTimeInterval duration = CFAbsoluteTimeGetCurrent() - startTime;
    NSLog(@"Elapsed time: %.1f ms", 1000*duration);
  }
}

The Swift output is:

Elapsed time: 89.35556 ms
Elapsed time: 63.243744 ms
Elapsed time: 62.39568 ms
Elapsed time: 62.183224 ms
Elapsed time: 63.741913 ms
Elapsed time: 63.59463 ms
Elapsed time: 62.378654 ms
Elapsed time: 61.746098 ms
Elapsed time: 61.530384 ms
Elapsed time: 60.88774 ms

The objective C output is

2024-04-18 19:27:45.704 compute_test[3489:92754] Elapsed time: 3.6 ms
2024-04-18 19:27:45.706 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.709 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.712 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.714 compute_test[3489:92754] Elapsed time: 2.7 ms
2024-04-18 19:27:45.717 compute_test[3489:92754] Elapsed time: 2.8 ms
2024-04-18 19:27:45.720 compute_test[3489:92754] Elapsed time: 2.8 ms
2024-04-18 19:27:45.723 compute_test[3489:92754] Elapsed time: 2.7 ms
2024-04-18 19:27:45.726 compute_test[3489:92754] Elapsed time: 2.5 ms
2024-04-18 19:27:45.728 compute_test[3489:92754] Elapsed time: 2.5 ms

I compile the Swift code for Release, optimised for speed. I can't believe there should be a difference here, so what could be different, and what might I be doing wrong? thanks Adrian

Just by curiosity: why do you pass commandQueue as fill argument in objC and not in Swift (even though that should not make a big difference) ?

Well spotted. That was a silly difference I meant to fix. objC uses a global, but yeah, it shouldn't matter but sorry for that distraction

Ok. I solved it myself, I knew posting it would help is somehow :-) Hear some help for future developers who might get confused in the same way. There is a difference in the code:

computeEncoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)

vs

[computeEncoder dispatchThreads:gridSize threadsPerThreadgroup:threadgroupSize];

I hadn't taken care to fully understand the difference between the functions. The functions look similar but the arguments although all MTLSize() have very different meaning, I was wrongly thinking the difference was whether the thread groups neatly align with the grid, or might go beyond the grid boundary where there isn't a neat divisor, but this isn't the only difference. (you should use dispatchThreads only if your device support non-uniform thread group sizes)

The crucial difference is that what I called gridSize doesn't have the same meaning in both cases. In the former its the number of threads groups over the grid, and in the later its the number of threads in the grid. In the former case, where on my GPU maxTotalThreadsPerThreadgroup=1024 I had

 gridSize = MTLSize(1024,1,1)
 threadsPerThreadgroup = MTLSize(1024,1,1)

so it seems I failed to get the parallelism I expected since the grid was divided into too many thread groups (perhaps!) I still don't fully understand it, and why I can't get full GPU utilisation like this, but clearly I'm getting better GPU utilisation with dispatchThreads. I checked the results in the array with some extra code, and I get the same results in all cases, the difference is just efficiency, not what work is done,

Happy coding!

See https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/2866532-dispatchthreads?language=objc and https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443138-dispatchthreadgroups?language=objc

High Kernel Dispatch Overhead for Metal for Swift
 
 
Q