Skip to content

[distributed] Handle object collectives and NCCL.#79034

Closed
kumpera wants to merge 2 commits intopytorch:masterfrom
kumpera:obj_collectives_fix
Closed

[distributed] Handle object collectives and NCCL.#79034
kumpera wants to merge 2 commits intopytorch:masterfrom
kumpera:obj_collectives_fix

Conversation

@kumpera
Copy link
Copy Markdown
Contributor

@kumpera kumpera commented Jun 7, 2022

This fixes all object collectives under NCCL and adds some automated tests for them.

This PR does not fix sending tensors using object collectives.

It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jun 7, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 655d136 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 7, 2022
Comment thread test/distributed/test_c10d_object_collectives.py Outdated
@rohan-varma rohan-varma self-requested a review June 7, 2022 20:01
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome refactoring! Looks good overall - only suggestion is around merging all of the tests.

Comment thread test/distributed/test_c10d_object_collectives.py Outdated
Comment thread test/distributed/test_c10d_object_collectives.py Outdated
Comment thread torch/distributed/distributed_c10d.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me, we should probably figure out the root cause issue of #65696, because we really don't want to be using ByteTensor and LongTensor style APIs anymore.

@kumpera kumpera force-pushed the obj_collectives_fix branch from 658c7f9 to 655d136 Compare June 9, 2022 14:25
@rohan-varma rohan-varma self-requested a review June 13, 2022 19:15
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kumpera
Copy link
Copy Markdown
Contributor Author

kumpera commented Jun 13, 2022

@pytorchmergebot merge please

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 13, 2022

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: please

usage: @pytorchbot {merge,revert,rebase,help} ...

Try @pytorchbot help for more info.

@kumpera
Copy link
Copy Markdown
Contributor Author

kumpera commented Jun 13, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Copy Markdown
Contributor

Hey @kumpera.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Jun 15, 2022
Summary:
This fixes all object collectives under NCCL and adds some automated tests for them.

This PR *does not* fix sending tensors using object collectives.

It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.

Pull Request resolved: #79034
Approved by: https://github.com/rohan-varma

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4ebb326b75024af600a966308965803cdc3437d9

Reviewed By: malfet

Differential Revision: D37153126

Pulled By: kumpera

fbshipit-source-id: caf01d830545b40d2827c336c350da3cd8ab552c
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@pytorchbot revert -m="Diff reverted internally" -c="ghfirst"

This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).)

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here

pytorchmergebot added a commit that referenced this pull request Jun 15, 2022
This reverts commit 4ebb326.

Reverted #79034 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
malfet added a commit that referenced this pull request Jun 15, 2022
facebook-github-bot pushed a commit that referenced this pull request Jun 15, 2022
…79034)

Test Plan: revert-hammer

Differential Revision:
D37153126

Original commit changeset: caf01d830545

Original Phabricator Diff: D37153126

fbshipit-source-id: fe023b2eaa028f0677997c63a0472d70df381253
facebook-github-bot pushed a commit that referenced this pull request Jun 15, 2022
…ctives and NCCL. (#79034)

Test Plan: revert-hammer

Differential Revision:
D37170190

Original commit changeset: fe023b2eaa02

Original Phabricator Diff: D37153126

fbshipit-source-id: 3f28eefe13ed0867f76c39d5866430393af2c153
pytorchmergebot pushed a commit that referenced this pull request Sep 7, 2022
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

Pull Request resolved: #84596
Approved by: https://github.com/H-Huang
facebook-github-bot pushed a commit that referenced this pull request Sep 8, 2022
Summary:
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

Pull Request resolved: #84596
Approved by: https://github.com/H-Huang

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/e96fb5d58c2accd717f0859b510ae7facb6d6aac

Reviewed By: izaitsevfb

Differential Revision: D39312639

Pulled By: kumpera

fbshipit-source-id: dc1b57b7ad464cf00b44ac6dbfca5349e9fd41b1
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
This fixes all object collectives under NCCL and adds some automated tests for them.

This PR *does not* fix sending tensors using object collectives.

It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.
Pull Request resolved: pytorch#79034
Approved by: https://github.com/rohan-varma
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in pytorch#79034

Pull Request resolved: pytorch#84596
Approved by: https://github.com/H-Huang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants