[distributed] Handle object collectives and NCCL.#79034
[distributed] Handle object collectives and NCCL.#79034kumpera wants to merge 2 commits intopytorch:masterfrom
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 655d136 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
rohan-varma
left a comment
There was a problem hiding this comment.
Awesome refactoring! Looks good overall - only suggestion is around merging all of the tests.
There was a problem hiding this comment.
This reminds me, we should probably figure out the root cause issue of #65696, because we really don't want to be using ByteTensor and LongTensor style APIs anymore.
658c7f9 to
655d136
Compare
|
@pytorchmergebot merge please |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchmergebot merge |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
Hey @kumpera. |
Summary: This fixes all object collectives under NCCL and adds some automated tests for them. This PR *does not* fix sending tensors using object collectives. It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it. Pull Request resolved: #79034 Approved by: https://github.com/rohan-varma Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4ebb326b75024af600a966308965803cdc3437d9 Reviewed By: malfet Differential Revision: D37153126 Pulled By: kumpera fbshipit-source-id: caf01d830545b40d2827c336c350da3cd8ab552c
|
@pytorchbot revert -m="Diff reverted internally" -c="ghfirst" This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).) |
|
@pytorchbot successfully started a revert job. Check the current status here |
This reverts commit 4ebb326. Reverted #79034 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
…79034) Test Plan: revert-hammer Differential Revision: D37153126 Original commit changeset: caf01d830545 Original Phabricator Diff: D37153126 fbshipit-source-id: fe023b2eaa028f0677997c63a0472d70df381253
…ctives and NCCL. (#79034) Test Plan: revert-hammer Differential Revision: D37170190 Original commit changeset: fe023b2eaa02 Original Phabricator Diff: D37153126 fbshipit-source-id: 3f28eefe13ed0867f76c39d5866430393af2c153
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034 Pull Request resolved: #84596 Approved by: https://github.com/H-Huang
Summary: The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034 Pull Request resolved: #84596 Approved by: https://github.com/H-Huang Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/e96fb5d58c2accd717f0859b510ae7facb6d6aac Reviewed By: izaitsevfb Differential Revision: D39312639 Pulled By: kumpera fbshipit-source-id: dc1b57b7ad464cf00b44ac6dbfca5349e9fd41b1
This fixes all object collectives under NCCL and adds some automated tests for them. This PR *does not* fix sending tensors using object collectives. It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it. Pull Request resolved: pytorch#79034 Approved by: https://github.com/rohan-varma
)" This reverts commit 9c6d2d8. Reverted pytorch#79034 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
…torch#79034)"" This reverts commit f2cac96.
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in pytorch#79034 Pull Request resolved: pytorch#84596 Approved by: https://github.com/H-Huang
This fixes all object collectives under NCCL and adds some automated tests for them.
This PR does not fix sending tensors using object collectives.
It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.