Skip to content

[c10d] Distributed Data Parallel CPU module for C10D#11168

Closed
teng-li wants to merge 1 commit intopytorch:masterfrom
teng-li:DDPCPU
Closed

[c10d] Distributed Data Parallel CPU module for C10D#11168
teng-li wants to merge 1 commit intopytorch:masterfrom
teng-li:DDPCPU

Conversation

@teng-li
Copy link
Contributor

@teng-li teng-li commented Sep 1, 2018

Distributed Data Parallel CPU module for c10d. This is basically the same code as Distributed Data Parallel CPU module for THD, since c10d now has the exact same front-end interface as torch.distributed.

We will keep both in the first release and remove the THD one once c10d is stable enough.

Test fully covered just as THD too.

@teng-li teng-li requested a review from pietern September 1, 2018 02:23
@teng-li teng-li added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 1, 2018
@teng-li
Copy link
Contributor Author

teng-li commented Sep 5, 2018

@pytorchbot retest this please

@pietern
Copy link
Contributor

pietern commented Sep 5, 2018

This doesn't do the overlapping during autograd like the CUDA version does.

Do you plan to add this later? Not blocking of course.

@teng-li
Copy link
Contributor Author

teng-li commented Sep 5, 2018

@pietern This was the version written by the open source community, depending on the need, we can add that later

@pietern
Copy link
Contributor

pietern commented Sep 5, 2018

Sounds good.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

petrex pushed a commit to petrex/pytorch that referenced this pull request Sep 6, 2018
* upstream/master: (26 commits)
  cudnn 7 upgrade with spatialBN fix (pytorch#11291)
  Ignore FuseGraph Call on Windows (pytorch#11015)
  defer resolution of mkl to a cmake wrapper library (pytorch#11298)
  Cleanup dependency of distributed flags (pytorch#11221)
  Move minimal wrapdim functionality to core, remove THTensor include i… (pytorch#11283)
  Change includes from ATen/Storage.h to ATen/core/Storage.h (pytorch#11217)
  Fix scalar tensor assert in fusion compiler (pytorch#10952)
  Add dead code elimination pass (pytorch#10101)
  Distributed Data Parallel CPU module for C10D (pytorch#11168)
  Back out "[pt1][tensor] Add strides to caffe2::Tensor"
  Fix conv gradient conversion (pytorch#11312)
  Bag of clang tidy fixes for torch/csrc/ and torch/csrc/autograd (pytorch#11050)
  Sparse tensor printing; add NotImplemented autograd fn (pytorch#10181)
  Add convertToCaffe2Proto to python API
  fix doc for functional.dropout* (pytorch#10417)
  typo fix Tranpose2D -> Transpose2D (pytorch#11281)
  Remove THFinalizer
  Forward declarations of needed curand functions (pytorch#10911)
  nomnigraph - simplify core graph API and test (pytorch#11256)
  Small fixes to cppdocs for sync script (pytorch#11300)
  ...
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
Distributed Data Parallel CPU module for c10d. This is basically the same code as Distributed Data Parallel CPU module for THD, since c10d now has the exact same front-end interface as torch.distributed.

We will keep both in the first release and remove the THD one once c10d is stable enough.

Test fully covered just as THD too.
Pull Request resolved: pytorch#11168

Differential Revision: D9674963

Pulled By: teng-li

fbshipit-source-id: ecf52a7189374ca7930c2be305218167fdd822a7
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants