Max M1 silicon compute threadgroup size

I'm playing with a library that outputs/generates opencl code (coriander). I'm trying to launch a compute kernel but I can't seem to get anything bigger than 256 threads per workgroup.

Can anyone confirm this is a hardware limitation? I can't find any info in the metal feature table for m1.

I'd like to know if this is actually the max threadgroup size or if there is an issue with the opencl drivers or the library doing the translation.

Thanks in advance.

Replies

M1 max is 1024. You can query a given pipeline with with [MTLComputePipelineState maxTotalThreadsPerThreadgroup] or you can override it by setting maxTotalThreadsPerThreadgroup on MTLComputePipelineDescriptor when you make your pipeline. But, if Metal is saying that it is by default enforcing 256 for a given kernel, you aren't likely to do better than it because its calculating how much shared memory/registers a given kernel is using, and is calculating what the threadgroup size should be to achieve maximum occupancy.

Here for more info:
https://developer.apple.com/documentation/metal/mtlcomputepipelinedescriptor/2966560-maxtotalthreadsperthreadgroup

So bottom line, if Metal is saying 256 for a given kernel, you will likely just want to use 256. If you must go higher, then you have to use a pipeline descriptor and set the property there when you instantiate the pipeline.

Good luck!
p.s. I was assuming you were converting the OpenCL into Metal because its a Metal forum. If you want to do in OpenCL it is a different answer obviously.
So you are correct I am technically using opencl. I shouldn't have posted here. So from a little but more investigation it seems that opencl exposes 256 as a max group size. Must be hard coded in the driver.
That would be strange. Are you are saying that CL_DEVICE_MAX_WORK_GROUP_SIZE comes back as 256? That's got to be a bug. Or is that coming back from CL_KERNEL_WORK_GROUP_SIZE?

If its coming back as just the kernel (meaning device max > kernel max), I vaguely remember that you had to set an attribute on the kernel at compile time, so something like __attribute__((reqd_work_group_size(1024, 1, 1))) would override the default (or fail).


I checked myself:

  CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256 
  CL_DEVICE_MAX_WORK_GROUP_SIZE: 256

You're totally right.
Thanks for checking. Good to know I'm not crazy!
This is not a hardware limitation. This does indeed look like a bug.

Please create a request via Feedback Assistant to fix this. Thanks.
FB9017493

thank you!
Thanks everyone much appreciated. :)