Metal mesh shader payload aliasing?

I have set up a simple mesh render pipeline, where I dispatch [N, 1, 1] object threadgroups, where N is the number of instances of a mesh to render. There is only one thread per object threadgroup, and each object threadgroup dispatches a single mesh threadgroup. Each mesh threadgroup passes an entire mesh instance to the rasterizer.

struct ObjectPayload {
    float3 position;
};

[[object, max_total_threads_per_threadgroup(1), max_total_threadgroups_per_mesh_grid(1)]]
void object_shader(
    object_data ObjectPayload& payload [[payload]],
    const device float3* positions [[buffer(0)]],
    uint16_t threadgroup_index [[threadgroup_position_in_grid]],
    mesh_grid_properties mgp
) {
    payload.position = positions[threadgroup_index];
    mgp.set_threadgroups_per_grid(uint3(1, 1, 1));
}

[[mesh]]
void mesh_shader(
    metal::mesh<...> output,
    const object_data ObjectPayload& payload [[payload]],
    uint16_t thread_index [[thread_index_in_threadgroup]],
    ...,
) {
    // set primitives, vertices, indices, etc based on payload.position
}

The issue I am having is every 2048 object threadgroups appear to be rendering the exact same position. I.e. even though the positions buffer contains unique positions for indices 0 and 2048, 1 and 2049, 2 and 2050, etc, the payload received by the mesh shader is identical for object threadgroup positions 0 and 2048, 1 and 2049, etc.

Logically, it's as if threadgroup_index in the object shader is being modulated by 2048. However, I have used xcode metal frame capture to confirm that positions is being read correctly in the object shader: the positions in the buffer are unique, threadgroup_index is the expected value, and the full range of positions is accessible.

This leads me to believe that somehow the payload is being shared/overwritten by every N % 2048 object threadgroups? Though this would be a Metal bug and unexpected.

I have confirmed that dispatching a drawMeshThreadgroups draw call of size [2048, 1, 1] in a loop (and binding the positions buffer with the appropriate offset) until all instances have been drawn achieves the correct result, but it is my understanding that there should be no threadgroup size limit when dispatching object shaders.

I am using a MacBook Pro 2020 14" M1 Pro.

I've further simplified the setup by removing the positions buffer and generating the position directly from the threadgroup index:

[[object, max_total_threads_per_threadgroup(1), max_total_threadgroups_per_mesh_grid(1)]]
void object_shader(
    object_data ObjectPayload& payload [[payload]],
    uint16_t threadgroup_index [[threadgroup_position_in_grid]],
    mesh_grid_properties mgp
) {
    payload.position = float3(threadgroup_index, 0, 0);
    mgp.set_threadgroups_per_grid(uint3(1, 1, 1));
}

With this setup, only the first 2048 instances appear to be rendered, because the positions begin to alias after 2048 instances.

@somenet, I am probably having the same issue as you. I am trying to test the limits of how many quads I can render in a single drawMeshThreadgroups call in a MacBook M2 Max with 32GB of RAM.

[render_command_encoder_metal drawMeshThreadgroups:threadgroups_per_grid
                       threadsPerObjectThreadgroup:threads_per_object_threadgroup
                         threadsPerMeshThreadgroup:threads_per_mesh_threadgroup];

If I set a grid of 64 x 32 (2048) object thread groups to draw a quad each I get the perfect expected chess pattern (see left side of image below). If I set anything that goes over 2048 thread groups, like for example a grid of 64 x 64 thread groups to draw a quad each, I get only the first 2048 thread groups to do anything.

I can push much more quads by using each object thread group to generate a bunch of mesh thread groups and get each mesh thread group to draw a quad. See image below for an example of a 64 x 32 grid of object thread groups generating a 2 x 2 grid of mesh thread groups and a 4 x 2 grid of mesh thread groups (each mesh thread group generates a quad).

I even pushed it all the way to the 32 x 32 mesh thread groups to test the documented limit of "1024 mesh thread groups / object thread group" on the grid of 64 x 32 object thread groups to get 2097152 quads (4194304 triangle primitives) and it works fine.

Any pointers on more Metal mesh shader examples or discussions and possibly limits for grid sizes of object thread groups per drawMeshThreadgroups call would be appreciated.

Metal mesh shader payload aliasing?
 
 
Q