I'm implementing a bitonic sort in Metal with a Swift app. This requires 100's kernel dispatch calls for each of the swap stages which touch the whole array, the work required by the GPU is small. I haven't been able to get this to run fast enough in Swift and it seems its due to a high overhead for each dispatchThread command. I rewrote the test program in Objective C with a super-simple kernel function and its runs 25x faster from Objective C!
Kernel function
kernel void fill(device uint8_t *array [[buffer(0)]],
                 const device uint32_t &N [[buffer(1)]],
                 const device uint8_t &value [[buffer(2)]],
                 uint i [[thread_position_in_grid]])
{
   if (i < N) {
      array[i] = value;
   }
}
The Swift code is:
func fill(pso:MTLComputePipelineState, buffer:MTLBuffer, N: Int, passes: Int) {
   guard let commandBuffer = commandQueue.makeCommandBuffer() else { return }
   let gridSize = MTLSizeMake(N, 1, 1)
   var threadGroupSize = pso.maxTotalThreadsPerThreadgroup
   if (threadGroupSize > N) {
      threadGroupSize = N;
   }
   let threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);
   for pass in 0..<passes {
      guard let computeEncoder = commandBuffer.makeComputeCommandEncoder() else { return }
      var value:UInt8 = UInt8(pass);
      var NN:UInt32 = UInt32(N);
      computeEncoder.setComputePipelineState(pso)
      computeEncoder.setBuffer(buffer, offset: 0, index: 0)
      computeEncoder.setBytes(&NN, length: MemoryLayout<UInt32>.size, index: 1)
      computeEncoder.setBytes(&value, length: MemoryLayout<UInt8>.size, index: 2)
      computeEncoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)
      computeEncoder.endEncoding()
   }
   commandBuffer.commit()
   commandBuffer.waitUntilCompleted()
}
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let commandQueue = device.makeCommandQueue()!
let funcFill = library.makeFunction(name: "fill")!
let pso = try? device.makeComputePipelineState(function: funcFill)
var N = 16384
let passes = 100
let buffer = device.makeBuffer(length:N, options: [.storageModePrivate])!
for _ in 1...10 {
   let startTime = DispatchTime.now()
   fill(pso:pso!, buffer:buffer, N:N, passes:passes)
   let endTime = DispatchTime.now()
   let elapsedTime = endTime.uptimeNanoseconds - startTime.uptimeNanoseconds
   print("Elapsed time:", Float(elapsedTime)/1_000_000, "ms");
}
and the Objective C code (which should be almost identical) is
void fill(id<MTLCommandQueue> commandQueue,
          id<MTLComputePipelineState> funcPSO,
          id<MTLBuffer> A,
          uint32_t N,
          int passes) {
  id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
  MTLSize gridSize = MTLSizeMake(N, 1, 1);
  NSUInteger threadGroupSize = funcPSO.maxTotalThreadsPerThreadgroup;
  if (threadGroupSize > N) {
      threadGroupSize = N;
  }
  MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);
  for(uint8_t pass=0; pass<passes; pass++)
  {
      id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
      [computeEncoder setComputePipelineState:funcPSO];
      [computeEncoder setBuffer:A offset:0 atIndex:0];
      [computeEncoder setBytes:&N length:sizeof(uint32_t) atIndex:1];
      [computeEncoder setBytes:&pass length:sizeof(uint8_t) atIndex:2];
      [computeEncoder dispatchThreads:gridSize threadsPerThreadgroup:threadgroupSize];
      [computeEncoder endEncoding];
  }
  [commandBuffer commit];
  [commandBuffer waitUntilCompleted];
}
int main() {
  NSError *error;
  id<MTLDevice> device = MTLCreateSystemDefaultDevice();
  id<MTLLibrary> library = [device newDefaultLibrary];
  id<MTLCommandQueue> commandQueue = [device newCommandQueue];
  id<MTLFunction> funcFill = [library newFunctionWithName:@"fill"];
  id<MTLComputePipelineState> pso = [device newComputePipelineStateWithFunction:funcFill error:&error];
  // Prepare data
  int N = 16384;
  int passes = 100;
  id<MTLBuffer> bufferA = [device newBufferWithLength:N options:MTLResourceStorageModePrivate];
  for(int it=1; it<=10; it++)
  {
    CFTimeInterval startTime = CFAbsoluteTimeGetCurrent();
    fill(commandQueue, pso, bufferA, N, passes);
    CFTimeInterval duration = CFAbsoluteTimeGetCurrent() - startTime;
    NSLog(@"Elapsed time: %.1f ms", 1000*duration);
  }
}
The Swift output is:
Elapsed time: 89.35556 ms
Elapsed time: 63.243744 ms
Elapsed time: 62.39568 ms
Elapsed time: 62.183224 ms
Elapsed time: 63.741913 ms
Elapsed time: 63.59463 ms
Elapsed time: 62.378654 ms
Elapsed time: 61.746098 ms
Elapsed time: 61.530384 ms
Elapsed time: 60.88774 ms
The objective C output is
2024-04-18 19:27:45.704 compute_test[3489:92754] Elapsed time: 3.6 ms
2024-04-18 19:27:45.706 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.709 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.712 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.714 compute_test[3489:92754] Elapsed time: 2.7 ms
2024-04-18 19:27:45.717 compute_test[3489:92754] Elapsed time: 2.8 ms
2024-04-18 19:27:45.720 compute_test[3489:92754] Elapsed time: 2.8 ms
2024-04-18 19:27:45.723 compute_test[3489:92754] Elapsed time: 2.7 ms
2024-04-18 19:27:45.726 compute_test[3489:92754] Elapsed time: 2.5 ms
2024-04-18 19:27:45.728 compute_test[3489:92754] Elapsed time: 2.5 ms
I compile the Swift code for Release, optimised for speed. I can't believe there should be a difference here, so what could be different, and what might I be doing wrong? thanks Adrian
