Skip to content

[c10d] PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests#10871

Closed
teng-li wants to merge 4 commits intopytorch:masterfrom
teng-li:c10d_release_frontend
Closed

[c10d] PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests#10871
teng-li wants to merge 4 commits intopytorch:masterfrom
teng-li:c10d_release_frontend

Conversation

@teng-li
Copy link
Contributor

@teng-li teng-li commented Aug 24, 2018

The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for torch.distributed
(2) env:// init method functionality
(3) Minor change to test_distributed.py, which is now a test for torch.distributed.c10d.
(4) The old test_distributed.py' is now moved to test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d

Now all the distributed test including c10d DDP can pass with the c10d frontend API

TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.

@teng-li teng-li added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 24, 2018
@teng-li teng-li requested review from apaszke and pietern August 24, 2018 23:31
@teng-li teng-li force-pushed the c10d_release_frontend branch from 72eed12 to b6123c1 Compare August 24, 2018 23:49
@teng-li teng-li changed the title [c10d] Distributed Release MileStone - Completed Distributed Frontend and CI tests [c10d] Distributed Release MileStone No.1 - Completed Distributed Frontend and CI tests Aug 24, 2018
@teng-li teng-li force-pushed the c10d_release_frontend branch 3 times, most recently from b2bd6eb to a51cdca Compare August 25, 2018 01:22
@teng-li teng-li changed the title [c10d] Distributed Release MileStone No.1 - Completed Distributed Frontend and CI tests [c10d] PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests Aug 25, 2018
@teng-li teng-li force-pushed the c10d_release_frontend branch 2 times, most recently from 2b60fbf to ed9a2e6 Compare August 25, 2018 08:08
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, aside from minor snafu. There is also the illegal memory access error for the NCCL tests.

This comment was marked as off-topic.

This comment was marked as off-topic.

@teng-li teng-li force-pushed the c10d_release_frontend branch 4 times, most recently from 86e12c1 to 2080b18 Compare August 28, 2018 05:55
@pietern
Copy link
Contributor

pietern commented Aug 28, 2018

Assuming this will turn green when #10932 is merged?

@teng-li
Copy link
Contributor Author

teng-li commented Aug 28, 2018

@pietern One more thing to do, is to fix the c10d ddp test, and hopefully it will be green

@teng-li teng-li force-pushed the c10d_release_frontend branch from 2080b18 to 7334a63 Compare August 28, 2018 18:51
@teng-li teng-li force-pushed the c10d_release_frontend branch from 7334a63 to 101af10 Compare August 28, 2018 20:37
@teng-li
Copy link
Contributor Author

teng-li commented Aug 29, 2018

@pytorchbot retest this please

@teng-li teng-li force-pushed the c10d_release_frontend branch 2 times, most recently from 3f76d1f to 1e7fe5e Compare August 29, 2018 06:04
@teng-li teng-li force-pushed the c10d_release_frontend branch 6 times, most recently from d2a2877 to da05147 Compare August 29, 2018 08:13
@teng-li teng-li force-pushed the c10d_release_frontend branch from da05147 to a89aa1d Compare August 29, 2018 08:30
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
…e and CI tests (pytorch#10871)

Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d

**Now all the distributed test including c10d DDP can pass with the c10d frontend API**

TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: pytorch#10871

Differential Revision: D9554514

Pulled By: teng-li

fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants