Skip to content

BAND, BOR and BXOR for NCCL (all_)reduce should throw runtime errors#42669

Closed
thinking-tower wants to merge 2 commits intopytorch:masterfrom
thinking-tower:bugfix/nccl-bitwise-operators
Closed

BAND, BOR and BXOR for NCCL (all_)reduce should throw runtime errors#42669
thinking-tower wants to merge 2 commits intopytorch:masterfrom
thinking-tower:bugfix/nccl-bitwise-operators

Conversation

@thinking-tower
Copy link
Copy Markdown
Contributor

@thinking-tower thinking-tower commented Aug 6, 2020

cc @rohan-varma
Fixes #41362 #39708

Description

NCCL doesn't support BAND, BOR, BXOR. Since the current mapping doesn't contain any of the mentioned bitwise operator, a default value of ncclSum is used instead.

This PR should provide the expected behaviour where a runtime exception is thrown.

Notes

@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Aug 6, 2020

💊 CI failures summary and remediations

As of commit 1130ac4 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

ci.pytorch.org: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 7 times.

@rohan-varma rohan-varma self-requested a review August 6, 2020 17:36
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, it looks great! Requesting changes for the assertRaisesRegex, and there are also a couple of python lint issues. You can view those by clicking on the failed CI job (we should have inline annotations as well, but those seem to not be working right now).

Comment thread torch/lib/c10d/ProcessGroupNCCL.cpp Outdated
Comment thread torch/lib/c10d/ProcessGroupNCCL.cpp Outdated
Comment thread test/distributed/test_c10d.py Outdated
@thinking-tower thinking-tower changed the title Use BAND, BOR and BXOR for NCCL (all_)reduce throws a runtime error. BAND, BOR and BXOR for NCCL (all_)reduce should throw runtime errors Aug 7, 2020
@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 7, 2020
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @thinking-tower!

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@rohan-varma merged this pull request in 6ebc050.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ytorch#42669)

Summary:
cc rohan-varma
Fixes pytorch#41362 pytorch#39708

# Description
NCCL doesn't support `BAND, BOR, BXOR`. Since the [current mapping](https://github.com/pytorch/pytorch/blob/fc0cbf7dc77596dde642dd242bbb9c2bfab61bb7/torch/lib/c10d/ProcessGroupNCCL.cpp#L39) doesn't contain any of the mentioned bitwise operator, a default value of `ncclSum` is used instead.

This PR should provide the expected behaviour where a runtime exception is thrown.

# Notes
- The way I'm throwing exceptions is derived from [ProcessGroupGloo.cpp](https://github.com/pytorch/pytorch/blob/fc0cbf7dc77596dde642dd242bbb9c2bfab61bb7/torch/lib/c10d/ProcessGroupGloo.cpp#L101)

Pull Request resolved: pytorch#42669

Reviewed By: ezyang

Differential Revision: D22996295

Pulled By: rohan-varma

fbshipit-source-id: 83a9fedf11050d2890f9f05ebcedf53be0fc3516
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.distributed NCCL backend does not support bitwise reduction ops

6 participants