[c10d] Working async version of AllGather, test fix and compiler warnings, and CI#10932
Closed
teng-li wants to merge 2 commits intopytorch:masterfrom
Closed
[c10d] Working async version of AllGather, test fix and compiler warnings, and CI#10932teng-li wants to merge 2 commits intopytorch:masterfrom
teng-li wants to merge 2 commits intopytorch:masterfrom
Conversation
109e585 to
a96580a
Compare
08a738f to
7a43626
Compare
Contributor
|
Looking at the code, was it the test case not allocating enough memory for the output tensors? |
pietern
approved these changes
Aug 28, 2018
Contributor
Author
|
@pietern The biggest issue was that each big tensor (flattened) should be on a different GPU, that was causing the error. |
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
PenghuiCheng
pushed a commit
to PenghuiCheng/pytorch
that referenced
this pull request
Sep 11, 2018
…nd CI (pytorch#10932) Summary: The previous NCCL all gather doesn't work as expected. This is a fully working async version. Tested on both C++ and Python Frontend. Multi-node: ``` tengli@learnfair042:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=0 WORLD_SIZE=2 ./ProcessGroupNCCLTest Multi-node world size: 2 rank: 0 Allreduce test successful Broadcast test successful Reduce test successful Allgather test successful tengli@learnfair117:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=1 WORLD_SIZE=2 ./ProcessGroupNCCLTest Multi-node world size: 2 rank: 1 Allreduce test successful Broadcast test successful Reduce test successful Allgather test successful ``` CI test: ``` test_set_get (__main__.FileStoreTest) ... ok test_set_get (__main__.PrefixFileStoreTest) ... ok test_set_get (__main__.PrefixTCPStoreTest) ... ok test_allreduce_ops (__main__.ProcessGroupGlooTest) ... ok test_broadcast_ops (__main__.ProcessGroupGlooTest) ... ok test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok test_common_errors (__main__.RendezvousFileTest) ... ok test_nominal (__main__.RendezvousFileTest) ... ok test_common_errors (__main__.RendezvousTCPTest) ... ok test_nominal (__main__.RendezvousTCPTest) ... ok test_unknown_handler (__main__.RendezvousTest) ... ok test_set_get (__main__.TCPStoreTest) ... ok ``` Pull Request resolved: pytorch#10932 Differential Revision: D9542067 Pulled By: teng-li fbshipit-source-id: 25513eddcc3119fd736875d69dfb631b10f4ac86
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The previous NCCL all gather doesn't work as expected. This is a fully working async version. Tested on both C++ and Python Frontend.
Multi-node:
CI test: