[dist] handle discontiguous allgather/reducescatter inputs#163712
[dist] handle discontiguous allgather/reducescatter inputs#163712
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163712
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 28e7772 with merge base 8e6b0c7 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
I wonder if we have to check the other collectives |
Other collectives that are not doing local copy are fine. Collectives that have channels-last on some ranks and contiguous on others will produce silent wrong results but we have no way to check it. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / inductor-build / build Details for Dev Infra teamRaised by workflow job |
| return at::empty(sizes, t.options()); | ||
| } else { | ||
| // memory-dense, but not necessarily contiguous tensor | ||
| std::vector<int64_t> strides{t.numel()}; |
There was a problem hiding this comment.
This always allocate it to the wrong length, might want to explictly reserve based on the length vector, emplace_back the first element, then do the insertion. Should prevent the std::vector from being reallocated.
| } | ||
| auto& t = tensors[0]; | ||
| at::DeviceGuard gpuGuard(t.device()); | ||
| std::vector<int64_t> sizes{static_cast<int64_t>(tensors.size())}; |
There was a problem hiding this comment.
Same here, do reserve, emplace_back first value, then do the insertion.
There was a problem hiding this comment.
pre-existing condition
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: trunk / inductor-build / build, trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.gfx942.1) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.9 -c critical --fixes #163483 |
Fixes #163483 Pull Request resolved: #163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501 (cherry picked from commit 71eec6a)
Cherry picking #163712The cherry pick PR is at #163987 and it is linked with issue #163483. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
[dist] handle discontiguous allgather/reducescatter inputs (#163712) Fixes #163483 Pull Request resolved: #163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501 (cherry picked from commit 71eec6a) Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
Fixes #163483 Pull Request resolved: #163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501
…g results (#166181) Summary: Per title cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci Reviewed By: kwen2501 Differential Revision: D85457960 Pulled By: ngimel
…tiguous (#166181) Per title Pull Request resolved: #166181 Approved by: https://github.com/kwen2501
…tiguous (#166181) Per title Pull Request resolved: #166181 Approved by: https://github.com/kwen2501 (cherry picked from commit 2efcf3c)
Fixes #163483
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci