[c10d] PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests#10871
[c10d] PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests#10871teng-li wants to merge 4 commits intopytorch:masterfrom
Conversation
72eed12 to
b6123c1
Compare
b2bd6eb to
a51cdca
Compare
2b60fbf to
ed9a2e6
Compare
pietern
left a comment
There was a problem hiding this comment.
LGTM, aside from minor snafu. There is also the illegal memory access error for the NCCL tests.
torch/lib/c10d/ProcessGroupMPI.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
86e12c1 to
2080b18
Compare
|
Assuming this will turn green when #10932 is merged? |
|
@pietern One more thing to do, is to fix the c10d ddp test, and hopefully it will be green |
2080b18 to
7334a63
Compare
7334a63 to
101af10
Compare
|
@pytorchbot retest this please |
3f76d1f to
1e7fe5e
Compare
d2a2877 to
da05147
Compare
da05147 to
a89aa1d
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…e and CI tests (pytorch#10871) Summary: The PR includes: (1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed` (2) `env://` init method functionality (3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`. (4) The old `test_distributed.py' is now moved to `test_distributed_thd` (5) Miscellaneous bug fixes. (6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d. (7) CI config to test MPI, NCCL, and Gloo backend of c10d **Now all the distributed test including c10d DDP can pass with the c10d frontend API** TODO: (in a separate PR) MPI subgroup support, once this is added, CI group test will be enabled. Pull Request resolved: pytorch#10871 Differential Revision: D9554514 Pulled By: teng-li fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for
torch.distributed(2)
env://init method functionality(3) Minor change to
test_distributed.py, which is now a test fortorch.distributed.c10d.(4) The old
test_distributed.py' is now moved totest_distributed_thd`(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d
Now all the distributed test including c10d DDP can pass with the c10d frontend API
TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.