Implement autograd functions for c10d communication operations #40762
Implement autograd functions for c10d communication operations #40762emcastillo wants to merge 6 commits intopytorch:masterfrom
Conversation
💊 CI failures summary and remediationsAs of commit 2c01091 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
I have stopped working in this for a few days due to some other tasks, will resume asap :) |
|
Hey @emcastillo, Do you plan to continue working on this, or is this task available for grab? No rush, just want to quickly check with you about the plan. |
|
Hi @mrshenli , |
514e234 to
cada738
Compare
cada738 to
7e1a528
Compare
|
Hi @mrshenli can you please take a look? thanks! |
7e1a528 to
0e15dff
Compare
|
I think I will support these calls in this PR and add the multigpu calls later. Thanks! |
Codecov Report
@@ Coverage Diff @@
## master #40762 +/- ##
==========================================
- Coverage 80.69% 80.52% -0.17%
==========================================
Files 1905 1906 +1
Lines 206789 206901 +112
==========================================
- Hits 166873 166613 -260
- Misses 39916 40288 +372 |
albanD
left a comment
There was a problem hiding this comment.
From the autograd point of view this looks good.
Some details about input modified inplace or just read to make sure your function is not "simplified away" but nothing crucial.
A few side questions though:
- How is this going to be used?
- I guess this is going to come but we definitely need tests for these. gradcheck and gradgradcheck should work fine here?
- It would be nice if you could add type annotations for the user-facing API (and it would make the code easier to read as well because I was expecting
srcto be a Tensor in these functions but it is not).
|
@albanD thanks for the awesome review! I am looking into all the comments and will update it soon :) |
|
Sorry for dropping ball on this, will review today. |
7d5167c to
24f644a
Compare
|
I think failures are unrelated now |
mrshenli
left a comment
There was a problem hiding this comment.
LGTM! Thanks for adding this!!!
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
albanD
left a comment
There was a problem hiding this comment.
I read correctly that none of the Function ever change their inplace inplace right?
There was a problem hiding this comment.
nit: Is this import actually necessary?
There was a problem hiding this comment.
Do you actually need this import since you access them throw torch below?
There was a problem hiding this comment.
torch.distributed.nn can't be accessed if its not directly imported :(
There was a problem hiding this comment.
Is that expected? Or something we should fix?
There was a problem hiding this comment.
I have no idea and left it as it was originally devised. I don't mind fixing it in this PR if you guys think its ok.
Or open a new PR so we can discuss this.
Thanks! will address all the comments during the weekend
There was a problem hiding this comment.
I do agree this is beyond the scope of this PR. Just wandering if it was an oversight or a design choice :D @mrshenli ?
There was a problem hiding this comment.
Are there any plans to consolidate this with the APIs in distributed_c10d.py?
24f644a to
22b78e4
Compare
|
This one should be ready for landing @mrshenli @albanD |
|
LGTM thanks for the update! |
|
@mrshenli you already have a diff for this so I'll let you do the land. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
|
Thank you everyone! |
…96) We provide convenience methods for some of the communication collectives, one main reason for that is they work in both distributed as well as non-distributed code. However in the none-distributed path we still were doing some work such as moving the data from cpu to gpu or vice versa. We can always avoid doing this work, reduce and gather ops should be noops if we only have one process. Another simplification is to remove the custom autograd function for gather. PyTorch added support for gradients in all_gather some time ago: pytorch/pytorch#40762, thus we can just use the normal all_gather functions
…96) We provide convenience methods for some of the communication collectives, one main reason for that is they work in both distributed as well as non-distributed code. However in the none-distributed path we still were doing some work such as moving the data from cpu to gpu or vice versa. We can always avoid doing this work, reduce and gather ops should be noops if we only have one process. Another simplification is to remove the custom autograd function for gather. PyTorch added support for gradients in all_gather some time ago: pytorch/pytorch#40762, thus we can just use the normal all_gather functions
…mmi-AI#96) We provide convenience methods for some of the communication collectives, one main reason for that is they work in both distributed as well as non-distributed code. However in the none-distributed path we still were doing some work such as moving the data from cpu to gpu or vice versa. We can always avoid doing this work, reduce and gather ops should be noops if we only have one process. Another simplification is to remove the custom autograd function for gather. PyTorch added support for gradients in all_gather some time ago: pytorch/pytorch#40762, thus we can just use the normal all_gather functions
…ch#40762) Summary: Closes pytorch#40702, Fixes pytorch#40690 Currently wip. But I would appreciate some feedback. Functions should be double-differentiable. Contrary to https://github.com/pytorch/pytorch/blob/716b2a6d69546db2aa3e91cfd88e92350cf0bf46/torch/nn/parallel/_functions.py This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct? Thanks! Pull Request resolved: pytorch#40762 Reviewed By: glaringlee Differential Revision: D24758889 Pulled By: mrshenli fbshipit-source-id: 79285fb4b791cae3d248f34e2aadb11c9ab10cce
Closes #40702, Fixes #40690
Currently wip. But I would appreciate some feedback. Functions should be double-differentiable.
Contrary to https://github.com/pytorch/pytorch/blob/b35cdc5200af963e410c0a25400fd07f30b89bca/torch/nn/parallel/_functions.py
This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct?
Thanks!