Normalize gradients before reduction in DistributedDataParallelC10d#11109
Normalize gradients before reduction in DistributedDataParallelC10d#11109myleott wants to merge 1 commit intopytorch:masterfrom myleott:export-D9594708
Conversation
apaszke
left a comment
There was a problem hiding this comment.
Normalization should happen on the coalesced buffers instead of individual parameters
facebook-github-bot
left a comment
There was a problem hiding this comment.
myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
would be super dope if you added a test for this, so that we dont regress on this in the future. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
|
@myleott agreeing on the above comment, it's super risky to do any DDP change right before our release |
|
I added the test yesterday :) But also this is a pretty trivial change and without it fp16 distributed training is much much worse, so I definitely think we should get it in before the release. |
apaszke
left a comment
There was a problem hiding this comment.
This might be important for stability and has a test now, so I'd vote to merge it before the release.
facebook-github-bot
left a comment
There was a problem hiding this comment.
myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
myleott is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…ytorch#11109) Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training. Pull Request resolved: pytorch#11109 Differential Revision: D9594708 Pulled By: myleott fbshipit-source-id: 93ab53cb782ee1cbe1264e529b333490a0940338
Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training.
Differential Revision: D9594708