Normalize gradients before reduction in DistributedDataParallelC10d by myleott · Pull Request #11109 · pytorch/pytorch

myleott · 2018-08-30T21:17:58Z

Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training.

Differential Revision: D9594708

apaszke

Normalization should happen on the coalesced buffers instead of individual parameters

facebook-github-bot

myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

soumith · 2018-09-02T03:34:15Z

would be super dope if you added a test for this, so that we dont regress on this in the future.

facebook-github-bot

myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

myleott · 2018-09-08T21:38:45Z

@pytorchbot retest this please

teng-li · 2018-09-10T07:38:32Z

@myleott agreeing on the above comment, it's super risky to do any DDP change right before our release

myleott · 2018-09-10T11:32:51Z

I added the test yesterday :) But also this is a pretty trivial change and without it fp16 distributed training is much much worse, so I definitely think we should get it in before the release.

apaszke

This might be important for stability and has a test now, so I'd vote to merge it before the release.

facebook-github-bot

myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

myleott is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…11109) Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training. Pull Request resolved: #11109 Differential Revision: D9594708 fbshipit-source-id: 95fe299ce2776d664e6652a05f45d9471a80a326

…ytorch#11109) Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training. Pull Request resolved: pytorch#11109 Differential Revision: D9594708 Pulled By: myleott fbshipit-source-id: 93ab53cb782ee1cbe1264e529b333490a0940338

myleott requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners August 30, 2018 21:17

apaszke suggested changes Aug 30, 2018

View reviewed changes

facebook-github-bot reviewed Aug 31, 2018

View reviewed changes

myleott requested review from pietern and teng-li as code owners September 8, 2018 18:27

facebook-github-bot reviewed Sep 8, 2018

View reviewed changes

apaszke approved these changes Sep 10, 2018

View reviewed changes

facebook-github-bot reviewed Sep 10, 2018

View reviewed changes

myleott mentioned this pull request Sep 10, 2018

[c10d] C10d release to torch.distributed for PT1 #11405

Closed

facebook-github-bot closed this in 18e5fd3 Sep 10, 2018

teng-li added a commit to teng-li/pytorch that referenced this pull request Sep 11, 2018

Rebase to incoporate pytorch#11109

c1f4720

ezyang added the merged label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize gradients before reduction in DistributedDataParallelC10d#11109

Normalize gradients before reduction in DistributedDataParallelC10d#11109
myleott wants to merge 1 commit intopytorch:masterfrom
myleott:export-D9594708

myleott commented Aug 30, 2018

Uh oh!

apaszke left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

soumith commented Sep 2, 2018

Uh oh!

facebook-github-bot left a comment

Uh oh!

myleott commented Sep 8, 2018

Uh oh!

teng-li commented Sep 10, 2018

Uh oh!

myleott commented Sep 10, 2018 •

edited

Loading

Uh oh!

apaszke left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

myleott commented Aug 30, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

soumith commented Sep 2, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

myleott commented Sep 8, 2018

Uh oh!

teng-li commented Sep 10, 2018

Uh oh!

myleott commented Sep 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

myleott commented Sep 10, 2018 •

edited

Loading