-
Notifications
You must be signed in to change notification settings - Fork 27.7k
[distributed] [nccl] should check whether reduceop is supported #39708
Copy link
Copy link
Closed
Labels
module: correctness (silent)issue that returns an incorrect result silentlyissue that returns an incorrect result silentlymodule: ncclProblems related to nccl supportProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
module: correctness (silent)issue that returns an incorrect result silentlyissue that returns an incorrect result silentlymodule: ncclProblems related to nccl supportProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Currently the input
ReduceOpgoes through a map to becomencclRedOp_t, e.g.,pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp
Line 718 in 0251ba6
However, this doesn't check whether
opts.reduceOpis supported or not, i.e., whether it is in the map or not. Whenopts.reduceOpis not in the map, the[]syntax will create an entry instead, using the default ctor ofncclRedOp_t, i.e., gettingncclSum!This causes correctness issues silently!!!!
The maps should be marked const, and checking code should be added.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar