Fixing SyncBN dgrad by jjsjann123 · Pull Request #36382 · pytorch/pytorch

jjsjann123 · 2020-04-10T10:37:14Z

Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad.

Previous PR which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path. Which produces wrong dgrad in those cases.

jjsjann123 · 2020-04-10T10:38:27Z

I have a repro code and verified the fix on my local machine.
I'll clean the code and add a tiny test.

dr-ci · 2020-04-10T10:38:41Z

💊 Build failures summary and remediations

As of commit 8a0c630 (more details on the Dr. CI page):

1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_xenial_py3_6_clang7_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 12 times.

jjsjann123 · 2020-04-10T10:38:47Z

cc'ing @ptrblck

jjsjann123 · 2020-04-11T02:12:25Z

Test added, should be good for review now. @ngimel

ngimel

Looks good, I have minor comments.

test/distributed/test_distributed.py

torch/nn/modules/_functions.py

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

torch/nn/modules/_functions.py

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-04-14T08:15:44Z

@ngimel merged this pull request in 289d52c.

Summary: Previous PR pytorch#22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: pytorch#36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd

Summary: Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: #36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd

Summary: The sync_bn batch norm has been using a non-efficient implementation NaiveSyncBatchNorm. This is because in pytorch<=1.5, nn.SyncBatchNorm had incorrect gradient when the batch size on each worker is different, so a NaiveSyncBatchNorm was implemented. This issue has been fixed in pytorch/pytorch#36382. So we change that implementation back to the faster nn.SyncBatchNorm. Since nn.SyncBatchNorm only has GPU implementation, for the unit tests running with single process on CPU, the original NaiveSyncBatchNorm is used. Reviewed By: wat3rBro Differential Revision: D35300913 fbshipit-source-id: c649f4dfc1ab5b2a0dd91d0187b072ae2de229ff

Summary: Pull Request resolved: #66 The sync_bn batch norm has been using a non-efficient implementation NaiveSyncBatchNorm. This is because in pytorch<=1.5, nn.SyncBatchNorm had incorrect gradient when the batch size on each worker is different, so a NaiveSyncBatchNorm was implemented. This issue has been fixed in pytorch/pytorch#36382. So we change that implementation back to the faster nn.SyncBatchNorm. Since nn.SyncBatchNorm only has GPU implementation, for the unit tests running with single process on CPU, the original NaiveSyncBatchNorm is used. Reviewed By: wat3rBro Differential Revision: D35300913 fbshipit-source-id: ec8107b78bd2a4be5e9452e6aeac6b89f0f04930

Fixing SyncBN dgrad

7737c54

Previous PR which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path. Which produces wrong dgrad in those cases.

jjsjann123 requested a review from ngimel April 10, 2020 10:37

jjsjann123 requested a review from apaszke as a code owner April 10, 2020 10:37

pytorchbot added the open source label Apr 10, 2020

test added

bcf7ca9

jjsjann123 requested review from mrshenli, pritamdamania87 and zhaojuanmao as code owners April 11, 2020 02:10

python linter fix

e658603

ngimel approved these changes Apr 12, 2020

View reviewed changes

test/distributed/test_distributed.py Show resolved Hide resolved

torch/nn/modules/_functions.py Outdated Show resolved Hide resolved

facebook-github-bot reviewed Apr 12, 2020

View reviewed changes

fmassa reviewed Apr 12, 2020

View reviewed changes

torch/nn/modules/_functions.py Show resolved Hide resolved

jjsjann123 added 2 commits April 13, 2020 11:02

update per review comments

9a33a04

fixing linter

8a0c630

facebook-github-bot reviewed Apr 13, 2020

View reviewed changes

ppwwyyxx mentioned this pull request Apr 13, 2020

Empty batch support for SyncBatchNorm #36530

Closed

facebook-github-bot closed this in 289d52c Apr 14, 2020

facebook-github-bot added the merged label Apr 14, 2020

gchanan mentioned this pull request Apr 15, 2020

[v1.5.0] Fixing SyncBN dgrad (#36382) #36688

Merged

mruberry added the Merged label Oct 28, 2020

renganxu mentioned this pull request Apr 21, 2022

Update the sync batch norm implementation for sync_bn facebookresearch/mobile-vision#66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing SyncBN dgrad#36382

Fixing SyncBN dgrad#36382
jjsjann123 wants to merge 5 commits intopytorch:masterfrom
jjsjann123:syncbn_fix

jjsjann123 commented Apr 10, 2020

Uh oh!

jjsjann123 commented Apr 10, 2020

Uh oh!

dr-ci bot commented Apr 10, 2020 •

edited

Loading

Uh oh!

jjsjann123 commented Apr 10, 2020

Uh oh!

jjsjann123 commented Apr 11, 2020

Uh oh!

ngimel left a comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Apr 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jjsjann123 commented Apr 10, 2020

Uh oh!

jjsjann123 commented Apr 10, 2020

Uh oh!

dr-ci bot commented Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 Build failures summary and remediations

XLA failure

Uh oh!

jjsjann123 commented Apr 10, 2020

Uh oh!

jjsjann123 commented Apr 11, 2020

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dr-ci bot commented Apr 10, 2020 •

edited

Loading