[c10d] Added Reduce,AllGather,Gather,Scatter Ops for NCCL and MPI process groups by teng-li · Pull Request #10058 · pytorch/pytorch

teng-li · 2018-07-31T06:06:47Z

Added

Reduce (both NCCL and MPI)
AllGather (both NCCL and MPI)
Gather (MPI)
Scatter (MPI)

for c10d process groups. This basically finalizes all supported ops for C10d to match THD.

All ops are tested as well.

mpirun -np 8 ./ProcessGroupMPITest
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful

./ProcessGroupNCCLTest
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful

…cess groups

torch/lib/c10d/CMakeLists.txt

 install(TARGETS c10d ARCHIVE DESTINATION lib)

-option(BUILD_EXAMPLES "Build examples" OFF)
+option(BUILD_EXAMPLES "Build examples" ON)


torch/lib/c10d/ProcessGroupMPI.cpp

+      throw std::runtime_error("Tensors are not equal in size or data type");
+    }
+    std::vector<at::Tensor> temp{tensors[i]};
+    checkSingleTensor(temp);


torch/lib/c10d/ProcessGroupMPI.cpp

+
+  std::function<void(std::unique_ptr<WorkEntry>&)> runFunc =
+      [opts, this](std::unique_ptr<WorkEntry>& entry) {
+        auto data = (*entry->src)[0];


torch/lib/c10d/ProcessGroupMPI.cpp

+  if (outputTensors.size() != 1) {
+    throw std::runtime_error(
+        "MPI process group only supports a single "
+        "tensor op");


torch/lib/c10d/ProcessGroupMPI.cpp

+    }
+  } else {
+    if (outputTensors.size() != 1) {
+      throw std::runtime_error("Gather: only single tensor op supported");


facebook-github-bot

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…ts (#10159) Summary: Provided python binding for these four ops. Also provided nccl binding test. Based on #10058 Please only review init.cpp, and test file. Pull Request resolved: #10159 Reviewed By: yf225 Differential Revision: D9323192 Pulled By: teng-li fbshipit-source-id: b03822009d3a785ec36fecce2fc3071d23f9994e

…oups (pytorch#10058) Summary: Added - Reduce (both NCCL and MPI) - AllGather (both NCCL and MPI) - Gather (MPI) - Scatter (MPI) for c10d process groups. This basically finalizes all supported ops for C10d to match THD. All ops are tested as well. ``` mpirun -np 8 ./ProcessGroupMPITest Test successful Test successful Test successful Test successful Test successful Test successful Test successful Test successful ``` ``` ./ProcessGroupNCCLTest Allreduce test successful Broadcast test successful Reduce test successful Allgather test successful ``` Pull Request resolved: pytorch#10058 Reviewed By: yf225 Differential Revision: D9316312 Pulled By: teng-li fbshipit-source-id: 6a6253268d34332327406b1f87335d1402f7133f

…ts (pytorch#10159) Summary: Provided python binding for these four ops. Also provided nccl binding test. Based on pytorch#10058 Please only review init.cpp, and test file. Pull Request resolved: pytorch#10159 Reviewed By: yf225 Differential Revision: D9323192 Pulled By: teng-li fbshipit-source-id: b03822009d3a785ec36fecce2fc3071d23f9994e

teng-li requested a review from apaszke July 31, 2018 06:06

teng-li requested a review from pietern as a code owner July 31, 2018 06:06

teng-li changed the title ~~[c10d] Added Reduce/Gather/Scatter/AllGather Ops for NCCL and MPI process groups~~ [c10d] Added Reduce,AllGather,Gather,Scatter Ops for NCCL and MPI process groups Jul 31, 2018

[c10d] Added Reduce,AllGather,Gather,Scatter Ops for NCCL and MPI pro…

33d9311

…cess groups

teng-li force-pushed the pg_nccl_ops branch from 49778a0 to 33d9311 Compare July 31, 2018 06:09

teng-li added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 31, 2018

Minor fixes

f225f13

teng-li mentioned this pull request Aug 2, 2018

[c10d] Python binding for reduce,allgather,scatter,gather ops and python tests #10159

Closed

apaszke approved these changes Aug 7, 2018

View reviewed changes

Addressed comments

de2fd21

facebook-github-bot reviewed Aug 14, 2018

View reviewed changes

teng-li force-pushed the pg_nccl_ops branch from a6faacf to 937cade Compare August 14, 2018 18:08

Fixed lint warnings and test failures

5f795f4

teng-li force-pushed the pg_nccl_ops branch from 937cade to 5f795f4 Compare August 14, 2018 18:10

facebook-github-bot reviewed Aug 14, 2018

View reviewed changes

facebook-github-bot closed this in 3f3a30f Aug 14, 2018

ezyang added the merged label Jun 26, 2019

Conversation

teng-li commented Jul 31, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants