Metal and low performance with parallel execution of kernels (MTLComputeCommandEncoder)

Hello All,

I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way.

CUDA Streams

Metal device: Mac Studio with M1 Ultra. (write the code on Metal-cpp)

I creating several MTLCommandBuffer in 1 MTLCommandQueue or several MTLCommandQueue with more MTLCommandBuffer.

Regarding Metal resources, there are two options:

  1. Buffers (MTLBuffer) was created with an option MTLResourceStorageModeShared. In the profiler, all Command buffers are performed sequentially on the timeline of Compute.

  2. Buffers (MTLBuffer) was created with an option "MTLResourceStorageModeShared | MTLResourceHazardTrackingModeUntracked". In the profiler, I really saw the parallelism. But the maximum number of threads in the Compute timeline is always no more than 2 (see pictures). Also weird.

Computing commands do not depend on each other.

METAL Compute timeline

About performance:

[1] In the first variant, the performance is the same for different amounts of MTLCommandQueue and MTLCommandBuffer.

[2] In the second variant, the performance for one MTLCommandBuffer is greater than for 2 or more.

Question: why is this happening? How to parallelize the work of the compute kernels to get an increase performance?

Addition information: Also, the CUDA code is rewritten in OpenCL, and it is perfectly parallelized in Windows(NVIDIA/AMD/Intel) if several OpenCL queues are running. The same code running on M1 Ultra works the same way with 1 or with many OpenCL queues. In turn, Metal is faster than OpenCL, so I am trying to figure out exactly Metal, and make the kernels work in parallel on Metal.

Replies

Are you creating the compute encoder with the computeCommandEncoder(MTL::DispatchType dispatchType) method with DispatchTypeConcurrent?

Yes, I tried both MTL::DispatchType::DispatchTypeSerial and MTL::DispatchType::DispatchTypeConcurrent. does not matter, behavior does not change. I use MTL::DispatchType::DispatchTypeConcurrent.

If you are looking for better performance, I highly recommend https://developer.apple.com/videos/play/wwdc2020/10603 It describes how to analyze the bottlenecks in your shaders and what to do about it.

In general, the motivation to have multiple dispatches running in parallel is because you believe there is some latency in one dispatch that can be hidden by work in another dispatch. Do you have that expectation here? For example, if you are bandwidth-bound, then it can be beneficial while one dispatch waits on a read to come back from memory, another dispatch is performing math.

In my experience on Apple GPUs, the case where concurrent dispatches improve overall compute performance is rare. The other techniques mentioned in the wwdc talk will help you understand what the opportunities are to improve your performance on our hardware.

Thanks, I will investigate and profile.

Perhaps you know why the Compute timeline shows a maximum of 2 lines with running Command Buffers? Could it be a design or is there another reason?