Skip to content

[distributed] [nccl] should check whether reduceop is supported #39708

@ssnl

Description

@ssnl

Currently the input ReduceOp goes through a map to become ncclRedOp_t, e.g.,

ncclOp[opts.reduceOp],

However, this doesn't check whether opts.reduceOp is supported or not, i.e., whether it is in the map or not. When opts.reduceOp is not in the map, the [] syntax will create an entry instead, using the default ctor of ncclRedOp_t, i.e., getting ncclSum!

This causes correctness issues silently!!!!

The maps should be marked const, and checking code should be added.

import torch
torch.distributed.init_process_group('nccl', init_method='tcp://localhost:10402', world_size=2, rank=0)
x = torch.zeros(3, device=0).fill_(2.4)
torch.distributed.reduce(x, 0, torch.distributed.ReduceOp.BXOR)
print(x)
# tensor([4.8000, 4.8000, 4.8000], device='cuda:0')
import torch
torch.distributed.init_process_group('nccl', init_method='tcp://localhost:10402', world_size=2, rank=1)
x = torch.zeros(3, device=1).fill_(2.4)
torch.distributed.reduce(x, 0, torch.distributed.ReduceOp.BXOR)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: correctness (silent)issue that returns an incorrect result silentlymodule: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions