Skip to content

use allgatherv for sparse all reduce#23917

Closed
zhaojuanmao wants to merge 1 commit intopytorch:masterfrom
zhaojuanmao:export-D16664985
Closed

use allgatherv for sparse all reduce#23917
zhaojuanmao wants to merge 1 commit intopytorch:masterfrom
zhaojuanmao:export-D16664985

Conversation

@zhaojuanmao
Copy link
Copy Markdown
Contributor

Summary:
per #22226, The current sparse allreduce in ProcessGroupGloo pads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv. This is mostly a win for memory usage if there is severe size imbalance between processes.

close #22226

Differential Revision: D16664985

@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 7, 2019
Copy link
Copy Markdown
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Mostly style/naming nits.

Also, I think that setAllGatherVOutput can be named just setOutput since the counts setter will be applicable to e.g. gatherv, scatterv as well.

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path style.

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cast needed?

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace -- you can run clang-format to fix up style before committing.

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be set as const outside the loop (and even at the beginning of the function).

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the number of dimensions but the number of elements in the dense dimensions, so wouldn't denseNumel (or similar) be more descriptive here?

@pietern
Copy link
Copy Markdown
Contributor

pietern commented Aug 8, 2019

The test show something is wrong in the implementation.

@pietern
Copy link
Copy Markdown
Contributor

pietern commented Aug 28, 2019

@zhaojuanmao It needs a rebase now because there seems to be a conflict with master.

Do you have time to continue this?

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@zhaojuanmao It needs a rebase now because there seems to be a conflict with master.

Do you have time to continue this?

the cuda test failed and needs some time to debug, possibly can work on it next week.

@zhaojuanmao zhaojuanmao force-pushed the export-D16664985 branch 2 times, most recently from 7806acd to c345402 Compare September 10, 2019 22:23
@zhaojuanmao zhaojuanmao requested a review from pietern September 16, 2019 16:29
Copy link
Copy Markdown
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the need for the contiguous() call?

Don't worry about the const-ness in the loop body if this doesn't require further changing.

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputs are always coalesced before running the algorithm and I was thinking that that implies they'll be contiguous.

Is this not the case?

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can both be const.

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can both be const.

Comment thread torch/lib/c10d/ProcessGroupGloo.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- I don't think this is needed.

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@pietern some tensors copied from cuda are not be contiguous, this is what I printed out locally while debugging, If I directly print out values using tensor data pointer, some values are garbage values.

@pietern
Copy link
Copy Markdown
Contributor

pietern commented Sep 16, 2019

This was against tensors when running tests? Do you have a repro for this?

I'm surprised this was not an issue before, because the max_nnz padding may have had the same issue and caused garbage crashes as well.

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

this is same for existing codes, I printed out locally for existing codes, some tensors copies from cuda is not contiguous. but they called copy_ to a contiguous buffer tensors before calling allgather, the copy_ is similar to call contiguous() call

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

yes, it is for tensors in unit tests. I printed values and is_contiguous() flag, for tensors with garbage values using data pointer, is_contiguous() is false

@pietern
Copy link
Copy Markdown
Contributor

pietern commented Sep 16, 2019

Ah yes, of course!

Can you create an issue for this? We call coalesce, which gives the underlying code the opportunity to coalesce as well. I don't think there is a good reason for not creating indices/values in an non-coalesced tensor if you're creating it anyway, so perhaps this is a legit bug.

Let's merge this now and then update if this is indeed a bug and there is a resolution.

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

Ah yes, of course!

Can you create an issue for this? We call coalesce, which gives the underlying code the opportunity to coalesce as well. I don't think there is a good reason for not creating indices/values in an non-coalesced tensor if you're creating it anyway, so perhaps this is a legit bug.

Let's merge this now and then update if this is indeed a bug and there is a resolution.

asked whether it is expected to have non-contiguous indices tensor after calling coalesce() for the sparse tensor here https://discuss.pytorch.org/t/indices-tensor-is-not-contiguous-after-calling-sparse-tensor-coalesce/56198, because based on codes https://our.internmc.facebook.com/intern/diffusion/FBS/browse/master/fbcode/caffe2/aten/src/ATen/native/sparse/SparseTensor.cpp?lines=350, looks like indices tensor should be contiguous

if the answer is not expected, will create a bug PR issue

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@pytorchbot retest

@pietern
Copy link
Copy Markdown
Contributor

pietern commented Sep 18, 2019

The "ensure no tabs" linter is for unrelated changes on master (which should be fixed separately).

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@pietern yesterday MacOS failed to build because it can not recognize "long" symbols somehow, I changed it to int64_t, now it passed the tests.

will land it soon

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaojuanmao is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
per pytorch#22226, The current sparse allreduce in ProcessGroupGloo pads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv. This is mostly a win for memory usage if there is severe size imbalance between processes.

close pytorch#22226
Pull Request resolved: pytorch#23917

Test Plan:
buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics

buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics_cuda

buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_checks

Differential Revision: D16664985

Pulled By: zhaojuanmao

fbshipit-source-id: a9a139da2b64617d2bb7f0b12f272e920140e5d1
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@zhaojuanmao merged this pull request in ed09704.

@zhaojuanmao zhaojuanmao deleted the export-D16664985 branch September 16, 2020 23:47
pytorchmergebot pushed a commit that referenced this pull request Oct 17, 2022
#23917 switched to using allgatherv instead of allgather for gloo sparse all-reduce. This PR removes a comment saying to use allgatherv if available since that has already been done.
Pull Request resolved: #87018
Approved by: https://github.com/H-Huang
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
per pytorch#22226, The current sparse allreduce in ProcessGroupGloo pads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv. This is mostly a win for memory usage if there is severe size imbalance between processes.

close pytorch#22226
Pull Request resolved: pytorch#23917

Test Plan:
buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics

buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics_cuda

buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_checks

Differential Revision: D16664985

Pulled By: zhaojuanmao

fbshipit-source-id: e7d3c0770cbc09f9175b3027b527e95053724843
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
pytorch#23917 switched to using allgatherv instead of allgather for gloo sparse all-reduce. This PR removes a comment saying to use allgatherv if available since that has already been done.
Pull Request resolved: pytorch#87018
Approved by: https://github.com/H-Huang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use allgatherv for sparse allreduce

5 participants