Metal offers both threadgroup_barrier()
and simdgroup_barrier()
. I understand the need for threadroup barriers — it would not be possible to rely on well cooperation between threads in a threadgroup without them, as different threads can execute on different SIMD partitions at different times. But I don't really get the simdgroup_barrier()
— it was my impression that all threads in a simdgroup execute in lockstep and this if one thread in a simdgroup makes progress, all other active threads in the simdgroup are also guaranteed to make progress. If this were not the case we'd need to insert simdgroup barrier pretty much any time we read or write any storage or perform SIMD-scoped operations. It doesn't seem like Apple uses simdgroup_barrier()
in any of their sample code. In fact, it seems like it's a no-op on current Apple Silicon hardware.
Is there a situation when I need to use simdgroup barriers or is this a superfluous operation?
P.S. It seems that Apple engineers are as confused by this as I am, see https://github.com/ml-explore/mlx/blame/1f6ab6a556045961c639735efceebbee7cce814d/mlx/backend/metal/kernels/scan.metal#L355