-
Notifications
You must be signed in to change notification settings - Fork 27.7k
torch.distributed NCCL backend does not support bitwise reduction ops #41362
Copy link
Copy link
Closed
Labels
better-engineeringRelatively self-contained tasks for better engineering contributorsRelatively self-contained tasks for better engineering contributorsmodule: bootcampWe plan to do a full writeup on the issue, and then get someone to do it for onboardingWe plan to do a full writeup on the issue, and then get someone to do it for onboardingmodule: ncclProblems related to nccl supportProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
better-engineeringRelatively self-contained tasks for better engineering contributorsRelatively self-contained tasks for better engineering contributorsmodule: bootcampWe plan to do a full writeup on the issue, and then get someone to do it for onboardingWe plan to do a full writeup on the issue, and then get someone to do it for onboardingmodule: ncclProblems related to nccl supportProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
🐛 Bug
The documentation at https://pytorch.org/docs/stable/distributed.html specifies that BAND, BOR, BXOR are supported reduction operators, however, they do not work with
all_reduceusing the NCCL backend.We can see in the code that there is no mapping for the bitwise operators, and we use this mapping to get the nccl operation to run. What happens when the mapping is not specified is that the map attempts to default construct a
ncclRedOp_ttype (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclredop-t) and ends up incorrectly mapping these reduction types toncclSum. This will mean that if we use these bitwise reduction ops we will just end up doing a sum.cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski