use allgatherv for sparse all reduce#23917
use allgatherv for sparse all reduce#23917zhaojuanmao wants to merge 1 commit intopytorch:masterfrom
Conversation
pietern
left a comment
There was a problem hiding this comment.
Nice work! Mostly style/naming nits.
Also, I think that setAllGatherVOutput can be named just setOutput since the counts setter will be applicable to e.g. gatherv, scatterv as well.
There was a problem hiding this comment.
Whitespace -- you can run clang-format to fix up style before committing.
There was a problem hiding this comment.
This can be set as const outside the loop (and even at the beginning of the function).
There was a problem hiding this comment.
This is not the number of dimensions but the number of elements in the dense dimensions, so wouldn't denseNumel (or similar) be more descriptive here?
|
The test show something is wrong in the implementation. |
|
@zhaojuanmao It needs a rebase now because there seems to be a conflict with master. Do you have time to continue this? |
the cuda test failed and needs some time to debug, possibly can work on it next week. |
70f8d17 to
2c04cc3
Compare
7806acd to
c345402
Compare
c345402 to
2837625
Compare
pietern
left a comment
There was a problem hiding this comment.
Can you clarify the need for the contiguous() call?
Don't worry about the const-ness in the loop body if this doesn't require further changing.
There was a problem hiding this comment.
The inputs are always coalesced before running the algorithm and I was thinking that that implies they'll be contiguous.
Is this not the case?
There was a problem hiding this comment.
Same as above -- I don't think this is needed.
|
@pietern some tensors copied from cuda are not be contiguous, this is what I printed out locally while debugging, If I directly print out values using tensor data pointer, some values are garbage values. |
|
This was against tensors when running tests? Do you have a repro for this? I'm surprised this was not an issue before, because the |
|
this is same for existing codes, I printed out locally for existing codes, some tensors copies from cuda is not contiguous. but they called copy_ to a contiguous buffer tensors before calling allgather, the copy_ is similar to call contiguous() call |
|
yes, it is for tensors in unit tests. I printed values and is_contiguous() flag, for tensors with garbage values using data pointer, is_contiguous() is false |
|
Ah yes, of course! Can you create an issue for this? We call coalesce, which gives the underlying code the opportunity to coalesce as well. I don't think there is a good reason for not creating indices/values in an non-coalesced tensor if you're creating it anyway, so perhaps this is a legit bug. Let's merge this now and then update if this is indeed a bug and there is a resolution. |
2837625 to
04960df
Compare
asked whether it is expected to have non-contiguous indices tensor after calling coalesce() for the sparse tensor here https://discuss.pytorch.org/t/indices-tensor-is-not-contiguous-after-calling-sparse-tensor-coalesce/56198, because based on codes https://our.internmc.facebook.com/intern/diffusion/FBS/browse/master/fbcode/caffe2/aten/src/ATen/native/sparse/SparseTensor.cpp?lines=350, looks like indices tensor should be contiguous if the answer is not expected, will create a bug PR issue |
|
@pytorchbot retest |
04960df to
d1eeacc
Compare
d1eeacc to
612c2c5
Compare
|
The "ensure no tabs" linter is for unrelated changes on master (which should be fixed separately). |
|
@pietern yesterday MacOS failed to build because it can not recognize "long" symbols somehow, I changed it to int64_t, now it passed the tests. will land it soon |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@zhaojuanmao is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: per pytorch#22226, The current sparse allreduce in ProcessGroupGloo pads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv. This is mostly a win for memory usage if there is severe size imbalance between processes. close pytorch#22226 Pull Request resolved: pytorch#23917 Test Plan: buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics_cuda buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_checks Differential Revision: D16664985 Pulled By: zhaojuanmao fbshipit-source-id: a9a139da2b64617d2bb7f0b12f272e920140e5d1
612c2c5 to
e7800ca
Compare
|
@zhaojuanmao merged this pull request in ed09704. |
#23917 switched to using allgatherv instead of allgather for gloo sparse all-reduce. This PR removes a comment saying to use allgatherv if available since that has already been done. Pull Request resolved: #87018 Approved by: https://github.com/H-Huang
Summary: per pytorch#22226, The current sparse allreduce in ProcessGroupGloo pads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv. This is mostly a win for memory usage if there is severe size imbalance between processes. close pytorch#22226 Pull Request resolved: pytorch#23917 Test Plan: buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics_cuda buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_checks Differential Revision: D16664985 Pulled By: zhaojuanmao fbshipit-source-id: e7d3c0770cbc09f9175b3027b527e95053724843
pytorch#23917 switched to using allgatherv instead of allgather for gloo sparse all-reduce. This PR removes a comment saying to use allgatherv if available since that has already been done. Pull Request resolved: pytorch#87018 Approved by: https://github.com/H-Huang
Summary:
per #22226, The current sparse allreduce in ProcessGroupGloo pads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv. This is mostly a win for memory usage if there is severe size imbalance between processes.
close #22226
Differential Revision: D16664985