Skip to content

torch.distributed NCCL backend does not support bitwise reduction ops #41362

@rohan-varma

Description

@rohan-varma

🐛 Bug

The documentation at https://pytorch.org/docs/stable/distributed.html specifies that BAND, BOR, BXOR are supported reduction operators, however, they do not work with all_reduce using the NCCL backend.

We can see in the code that there is no mapping for the bitwise operators, and we use this mapping to get the nccl operation to run. What happens when the mapping is not specified is that the map attempts to default construct a ncclRedOp_t type (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclredop-t) and ends up incorrectly mapping these reduction types to ncclSum. This will mean that if we use these bitwise reduction ops we will just end up doing a sum.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

Metadata

Metadata

Assignees

No one assigned

    Labels

    better-engineeringRelatively self-contained tasks for better engineering contributorsmodule: bootcampWe plan to do a full writeup on the issue, and then get someone to do it for onboardingmodule: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions