Question
I find a interesting problem that, I thought all-reduce and reduce-scatter has the same (k - 1) steps before.
However, reduce-scatter just use send and recv, with allreduce use directsend and directrecv.
https://github.com/NVIDIA/nccl/blob/master/src/device/all_reduce.h#L48
https://github.com/NVIDIA/nccl/blob/master/src/device/reduce_scatter.h#L42
I wonder what's the difference? Or, why cause the difference? Is it related to performance?
PS: gpt-5 told me that reduce-scatter has less steps, so it can save the time of pointer conversion. Is it right?
Question
I find a interesting problem that, I thought all-reduce and reduce-scatter has the same (k - 1) steps before.
However, reduce-scatter just use send and recv, with allreduce use directsend and directrecv.
https://github.com/NVIDIA/nccl/blob/master/src/device/all_reduce.h#L48
https://github.com/NVIDIA/nccl/blob/master/src/device/reduce_scatter.h#L42
I wonder what's the difference? Or, why cause the difference? Is it related to performance?
PS: gpt-5 told me that reduce-scatter has less steps, so it can save the time of pointer conversion. Is it right?