Implement autograd functions for c10d communication operations by emcastillo · Pull Request #40762 · pytorch/pytorch

emcastillo · 2020-06-30T08:50:03Z

Currently wip. But I would appreciate some feedback. Functions should be double-differentiable.

Contrary to https://github.com/pytorch/pytorch/blob/b35cdc5200af963e410c0a25400fd07f30b89bca/torch/nn/parallel/_functions.py
This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct?

Thanks!

dr-ci · 2020-06-30T09:20:34Z

💊 CI failures summary and remediations

As of commit 2c01091 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

emcastillo · 2020-07-07T04:07:11Z

I have stopped working in this for a few days due to some other tasks, will resume asap :)

mrshenli · 2020-09-01T16:22:35Z

Hey @emcastillo, Do you plan to continue working on this, or is this task available for grab? No rush, just want to quickly check with you about the plan.

emcastillo · 2020-09-02T03:38:39Z

Hi @mrshenli ,
I am planning to finish this by next week :)

emcastillo · 2020-09-11T08:22:41Z

Hi @mrshenli can you please take a look? thanks!

emcastillo · 2020-09-11T08:26:13Z

I think I will support these calls in this PR and add the multigpu calls later.
I am a bit concerned about the correctness of the backward passes so some double check would be great here.

Thanks!

codecov · 2020-09-11T11:46:56Z

Codecov Report

Merging #40762 (2c01091) into master (a72c6fd) will decrease coverage by 0.16%.
The diff coverage is 41.07%.

@@            Coverage Diff             @@
##           master   #40762      +/-   ##
==========================================
- Coverage   80.69%   80.52%   -0.17%     
==========================================
  Files        1905     1906       +1     
  Lines      206789   206901     +112     
==========================================
- Hits       166873   166613     -260     
- Misses      39916    40288     +372

albanD

From the autograd point of view this looks good.
Some details about input modified inplace or just read to make sure your function is not "simplified away" but nothing crucial.

A few side questions though:

How is this going to be used?
I guess this is going to come but we definitely need tests for these. gradcheck and gradgradcheck should work fine here?
It would be nice if you could add type annotations for the user-facing API (and it would make the code easier to read as well because I was expecting src to be a Tensor in these functions but it is not).

emcastillo · 2020-09-15T08:46:24Z

@albanD thanks for the awesome review! I am looking into all the comments and will update it soon :)

mrshenli · 2020-10-13T14:30:13Z

Sorry for dropping ball on this, will review today.

emcastillo · 2020-11-11T06:52:50Z

I think failures are unrelated now

mrshenli

LGTM! Thanks for adding this!!!

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

albanD

I read correctly that none of the Function ever change their inplace inplace right?

albanD · 2020-11-11T15:15:01Z

nit: Is this import actually necessary?

albanD · 2020-11-11T15:18:44Z

Do you actually need this import since you access them throw torch below?

torch.distributed.nn can't be accessed if its not directly imported :(

Is that expected? Or something we should fix?

I have no idea and left it as it was originally devised. I don't mind fixing it in this PR if you guys think its ok.
Or open a new PR so we can discuss this.
Thanks! will address all the comments during the weekend

I do agree this is beyond the scope of this PR. Just wandering if it was an oversight or a design choice :D @mrshenli ?

pritamdamania87 · 2020-11-11T21:23:19Z

Are there any plans to consolidate this with the APIs in distributed_c10d.py?

emcastillo · 2021-01-12T08:51:51Z

torch.distributed.nn import fails on windows because it tries to directly import the rpc interface which is not currently supported.
I added a check to notify when this is not available and skip the tests. I think this should be addressed in a different PR.

This one should be ready for landing @mrshenli @albanD
Sorry for the delay and thanks!

albanD · 2021-01-15T15:50:11Z

LGTM thanks for the update!

albanD · 2021-01-15T15:52:15Z

@mrshenli you already have a diff for this so I'll let you do the land.
Let me know if you don't have time and I will commandeer the diff from you.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-01-26T03:09:37Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebook-github-bot · 2021-01-26T15:57:01Z

@mrshenli merged this pull request in 233e4eb.

emcastillo · 2021-01-27T01:27:07Z

Thank you everyone!

…96) We provide convenience methods for some of the communication collectives, one main reason for that is they work in both distributed as well as non-distributed code. However in the none-distributed path we still were doing some work such as moving the data from cpu to gpu or vice versa. We can always avoid doing this work, reduce and gather ops should be noops if we only have one process. Another simplification is to remove the custom autograd function for gather. PyTorch added support for gradients in all_gather some time ago: pytorch/pytorch#40762, thus we can just use the normal all_gather functions

…mmi-AI#96) We provide convenience methods for some of the communication collectives, one main reason for that is they work in both distributed as well as non-distributed code. However in the none-distributed path we still were doing some work such as moving the data from cpu to gpu or vice versa. We can always avoid doing this work, reduce and gather ops should be noops if we only have one process. Another simplification is to remove the custom autograd function for gather. PyTorch added support for gradients in all_gather some time ago: pytorch/pytorch#40762, thus we can just use the normal all_gather functions

…ch#40762) Summary: Closes pytorch#40702, Fixes pytorch#40690 Currently wip. But I would appreciate some feedback. Functions should be double-differentiable. Contrary to https://github.com/pytorch/pytorch/blob/716b2a6d69546db2aa3e91cfd88e92350cf0bf46/torch/nn/parallel/_functions.py This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct? Thanks! Pull Request resolved: pytorch#40762 Reviewed By: glaringlee Differential Revision: D24758889 Pulled By: mrshenli fbshipit-source-id: 79285fb4b791cae3d248f34e2aadb11c9ab10cce

emcastillo requested review from apaszke, mrshenli, pietern and zhaojuanmao as code owners June 30, 2020 08:50

pytorchbot added the open source label Jun 30, 2020

osalpekar requested review from jiayisuse and osalpekar June 30, 2020 17:36

Ze-Yang mentioned this pull request Jul 1, 2020

Using DistributedDataParallel through NCCL throws RuntimeError #19840

Closed

emcastillo force-pushed the dist-backward branch from 514e234 to cada738 Compare September 4, 2020 04:25

emcastillo mentioned this pull request Sep 7, 2020

Allow tuple of Tensor to be passed to autograd.Function #44276

Closed

emcastillo force-pushed the dist-backward branch from cada738 to 7e1a528 Compare September 11, 2020 08:21

emcastillo changed the title ~~[WIP] Implement autograd functions for c10d communication operations~~ Implement autograd functions for c10d communication operations Sep 11, 2020

emcastillo force-pushed the dist-backward branch from 7e1a528 to 0e15dff Compare September 11, 2020 08:25

VitalyFedyunin added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 13, 2020

VitalyFedyunin requested a review from albanD September 13, 2020 19:51

albanD reviewed Sep 14, 2020

View reviewed changes

emcastillo requested review from pritamdamania87 and rohan-varma as code owners September 18, 2020 09:29

emcastillo commented Sep 18, 2020

View reviewed changes

Comment thread test/distributed/test_c10d.py Outdated

mrshenli requested changes Oct 14, 2020

View reviewed changes

mrshenli mentioned this pull request Oct 28, 2020

[POLL][RFC] Can we retire Single-Process Multi-Device Mode from DistributedDataParallel? #47012

Closed

emcastillo force-pushed the dist-backward branch from 7d5167c to 24f644a Compare November 11, 2020 02:58

mrshenli approved these changes Nov 11, 2020

View reviewed changes

facebook-github-bot reviewed Nov 11, 2020

View reviewed changes

albanD reviewed Nov 11, 2020

View reviewed changes

pritamdamania87 reviewed Nov 11, 2020

View reviewed changes

Emilio Castillo added 4 commits January 12, 2021 02:34

Distributed collectives with autograd

980d51a

review changes

cc2e281

review changes

326955b

Change tests to use spawn

22b78e4

emcastillo force-pushed the dist-backward branch from 24f644a to 22b78e4 Compare January 12, 2021 02:35

Emilio Castillo added 2 commits January 12, 2021 03:03

Review comments

a39d8e3

Disable tests if torch.distributed.nn cant be imported

2c01091

albanD approved these changes Jan 15, 2021

View reviewed changes

facebook-github-bot reviewed Jan 26, 2021

View reviewed changes

facebook-github-bot closed this in 233e4eb Jan 26, 2021

facebook-github-bot added the Merged label Jan 26, 2021

HennerM mentioned this pull request Mar 3, 2026

chore: fast paths for distributed functions, remove obsolete wrapper Emmi-AI/noether#96

Merged

Conversation

emcastillo commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Jun 30, 2020 • edited by facebook-github-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

emcastillo commented Jul 7, 2020

Uh oh!

mrshenli commented Sep 1, 2020

Uh oh!

emcastillo commented Sep 2, 2020

Uh oh!

emcastillo commented Sep 11, 2020

Uh oh!

emcastillo commented Sep 11, 2020

Uh oh!

codecov Bot commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emcastillo commented Sep 15, 2020

Uh oh!

Uh oh!

mrshenli commented Oct 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emcastillo commented Nov 11, 2020

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

emcastillo Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

emcastillo Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emcastillo commented Jun 30, 2020 •

edited

Loading

dr-ci Bot commented Jun 30, 2020 •

edited by facebook-github-bot

Loading

codecov Bot commented Sep 11, 2020 •

edited

Loading