In a compute ("kernel") function, is there an implicit ordering to threadgroup execution.
Say I have a 32xN buffer. Take following trivial kernel (ignoring boundary conditions):
kernel void example(device uint *buffer [[ buffer(0) ]],
                    const uint2 pos [[ thread_position_in_grid ]]) {
     uint index = pos.y * N + pos.x;
     buffer[i - 1] = buffer[i]
     buffer[i + 1] = 0;
}If we dispatch that kernel in N 32x1 threadgroups, is there any way that the thread groups are executed left-to-right? i.e. the resulting buffer is all 0 (once again, ignoring the boundary conditions).
Thanks