I have set up a simple mesh render pipeline, where I dispatch [N, 1, 1] object threadgroups, where N is the number of instances of a mesh to render. There is only one thread per object threadgroup, and each object threadgroup dispatches a single mesh threadgroup. Each mesh threadgroup passes an entire mesh instance to the rasterizer.
struct ObjectPayload {
float3 position;
};
[[object, max_total_threads_per_threadgroup(1), max_total_threadgroups_per_mesh_grid(1)]]
void object_shader(
object_data ObjectPayload& payload [[payload]],
const device float3* positions [[buffer(0)]],
uint16_t threadgroup_index [[threadgroup_position_in_grid]],
mesh_grid_properties mgp
) {
payload.position = positions[threadgroup_index];
mgp.set_threadgroups_per_grid(uint3(1, 1, 1));
}
[[mesh]]
void mesh_shader(
metal::mesh<...> output,
const object_data ObjectPayload& payload [[payload]],
uint16_t thread_index [[thread_index_in_threadgroup]],
...,
) {
// set primitives, vertices, indices, etc based on payload.position
}
The issue I am having is every 2048 object threadgroups appear to be rendering the exact same position. I.e. even though the positions buffer contains unique positions for indices 0 and 2048, 1 and 2049, 2 and 2050, etc, the payload received by the mesh shader is identical for object threadgroup positions 0 and 2048, 1 and 2049, etc.
Logically, it's as if threadgroup_index in the object shader is being modulated by 2048. However, I have used xcode metal frame capture to confirm that positions is being read correctly in the object shader: the positions in the buffer are unique, threadgroup_index is the expected value, and the full range of positions is accessible.
This leads me to believe that somehow the payload is being shared/overwritten by every N % 2048 object threadgroups? Though this would be a Metal bug and unexpected.
I have confirmed that dispatching a drawMeshThreadgroups draw call of size [2048, 1, 1] in a loop (and binding the positions buffer with the appropriate offset) until all instances have been drawn achieves the correct result, but it is my understanding that there should be no threadgroup size limit when dispatching object shaders.
I am using a MacBook Pro 2020 14" M1 Pro.