When is a `simdgroup_barrier()` required?

Metal offers both threadgroup_barrier() and simdgroup_barrier(). I understand the need for threadroup barriers — it would not be possible to rely on well cooperation between threads in a threadgroup without them, as different threads can execute on different SIMD partitions at different times. But I don't really get the simdgroup_barrier() — it was my impression that all threads in a simdgroup execute in lockstep and this if one thread in a simdgroup makes progress, all other active threads in the simdgroup are also guaranteed to make progress. If this were not the case we'd need to insert simdgroup barrier pretty much any time we read or write any storage or perform SIMD-scoped operations. It doesn't seem like Apple uses simdgroup_barrier() in any of their sample code. In fact, it seems like it's a no-op on current Apple Silicon hardware.

Is there a situation when I need to use simdgroup barriers or is this a superfluous operation?

P.S. It seems that Apple engineers are as confused by this as I am, see https://github.com/ml-explore/mlx/blame/1f6ab6a556045961c639735efceebbee7cce814d/mlx/backend/metal/kernels/scan.metal#L355

I found related resources in WWDC.

According to https://developer.apple.com/wwdc16/606?time=869 and https://developer.apple.com/wwdc20/10631?time=1481, they are using simdgroup_barrier() if thread group fits in a single SIMD group because threadgroup_barrier() is more expensive than simdgroup_barrier().

simdgroup size is normally 32 (afaik 64 in some AMD gpus) so I think simdgroup_barrier() can be used if thread group size is smaller than 32 or 64.

When is a `simdgroup_barrier()` required?
 
 
Q