Fix SyncBatchNorm running var update issue by unlimblue · Pull Request #22248 · pytorch/pytorch

unlimblue · 2019-06-26T02:28:30Z

Fix #22192

change signature: func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)
change cuda & cuda head

std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean,
                                                        const Tensor& running_var, double momentum, double epsilon, int64_t count) {
                                                        const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)

change python interface

class SyncBatchNorm(Function):
    def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
        ...

mrshenli · 2019-06-26T03:45:12Z

@pytorchbot retest this please

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

unlimblue · 2019-06-26T06:17:57Z

@mrshenli pytorch_xla_linux_trusty_py3_6_gcc5_4_build could retest again.

mrshenli

Thanks for contributing! Can you add a test in test/test_distributed.py to cover different input sizes?

mrshenli · 2019-06-26T17:54:32Z

aten/src/ATen/native/native_functions.yaml

    CUDA: batch_norm_elemt_cuda

- func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor)
+- func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)


Will this cause any backward compatibility issue? It's true that this method is only used in torch/nn/modules/_functions.py in PyTorch, but we do expose it as an API. I am a little worried it is used in other projects. @gchanan is it OK to break BC here? Or should we keep both API where a single count arg is automatically expanded to a list?

Copy that, and is it possible change const Tensor& counts to const ArrayRef<index_t> counts? I don't know the best way of convert ArrayRef<index_t> to a pytorch Tensor in c++.

Thanks, @hux999 is trying to fix Tensor counts issue.

I don't know the best way of convert ArrayRef<index_t> to a pytorch Tensor in c++.

Can you try using the from_blob API (learnt from @yf225 ).
See forum post

Use IntArrayRef instead of Tensor for the type of counts

mrshenli · 2019-06-28T17:09:23Z

@pytorchbot retest this please

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…uple of ints, but found element of type list at pos 1

unlimblue · 2019-06-30T05:53:50Z

We have found the reason, and trying to fix that.

unlimblue · 2019-07-01T06:31:53Z

@mrshenli are there some ci cache issue?

Jul 01 06:22:58 ======================================================================
Jul 01 06:22:58 ERROR: test_clip_grad_norm (__main__.TestNN)
Jul 01 06:22:58 ----------------------------------------------------------------------
Jul 01 06:22:58 Traceback (most recent call last):
Jul 01 06:22:58   File "test_nn.py", line 1829, in test_clip_grad_norm
Jul 01 06:22:58     norm = clip_grad_norm_(l.parameters(), max_norm, norm_type=norm_type)
Jul 01 06:22:58   File "/opt/conda/lib/python3.6/site-packages/torch/nn/utils/clip_grad.py", line 36, in clip_grad_norm_
Jul 01 06:22:58     clip_coef = torch.tensor(max_norm, device=device) / (total_norm + 1e-6)
Jul 01 06:22:58 UnboundLocalError: local variable 'device' referenced before assignment
Jul 01 06:22:58 
Jul 01 06:22:58 ----------------------------------------------------------------------

mrshenli · 2019-07-01T14:01:59Z

@unlimblue that's not your fault. Other PR (e.g., #22392) also hit this error.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2019-07-01T14:10:28Z

aten/src/ATen/native/native_functions.yaml

    CUDA: batch_norm_elemt_cuda

- func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor)
+- func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int[] counts) -> (Tensor, Tensor)


I believe this is still BC-breaking. As mentioned above, can you try keep both the original one and the new one?

@mrshenli you should probably also deprecate the old one so we can get rid of it in the future.

ssnl · 2019-07-01T16:37:52Z

There is an issue on the test error: #22052. Ignore it.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2019-07-02T18:19:56Z

@pytorchbot rebase this please

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli

Thanks for contributing!

Summary: ## Fix pytorch/pytorch#22192 + change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)` + change cuda & cuda head ```cuda std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, int64_t count) { const Tensor& running_var, double momentum, double epsilon, const Tensor& counts) ``` + change python interface ```python class SyncBatchNorm(Function): def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size): ... ``` Pull Request resolved: pytorch/pytorch#22248 Differential Revision: D16002146 Pulled By: mrshenli fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956

facebook-github-bot · 2019-07-04T01:11:42Z

@mrshenli merged this pull request in 29ec476.

Summary: ## Fix pytorch#22192 + change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)` + change cuda & cuda head ```cuda std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, int64_t count) { const Tensor& running_var, double momentum, double epsilon, const Tensor& counts) ``` + change python interface ```python class SyncBatchNorm(Function): def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size): ... ``` Pull Request resolved: pytorch#22248 Differential Revision: D16002146 Pulled By: mrshenli fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956

Summary: Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: #36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd

Summary: Previous PR pytorch#22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: pytorch#36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd

Summary: Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: #36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd

Shuaipeng Li added 2 commits June 26, 2019 10:21

fix #22192

49371aa

Merge branch 'master' of https://github.com/pytorch/pytorch

375b3af

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn module: operators labels Jun 26, 2019

ezyang added the open source label Jun 26, 2019

unlimblue mentioned this pull request Jun 26, 2019

Sync Batchnorm running var update issue #22192

Closed

mrshenli self-requested a review June 26, 2019 03:44

facebook-github-bot reviewed Jun 26, 2019

View reviewed changes

unlimblue changed the title ~~Fix https://github.com/pytorch/pytorch/issues/22192~~ Fix SyncBatchNorm running var update issue Jun 26, 2019

mrshenli reviewed Jun 26, 2019

View reviewed changes

ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 27, 2019

add syncbn diff input sizes running values test

f6eca91

pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 27, 2019

unlimblue and others added 3 commits June 27, 2019 12:14

add syncbn diff input sizes running values test [fix flake8]

34ae442

Use IntArrayRef instead of Tensor for the type of counts

56d22da

Merge pull request #1 from hux999/master

e2d306f

Use IntArrayRef instead of Tensor for the type of counts

facebook-github-bot reviewed Jun 28, 2019

View reviewed changes

unlimblue added 4 commits June 30, 2019 05:20

update functions

51c4e78

Merge branch 'master' of https://github.com/unlimblue/pytorch

17e061e

fix TypeError: batch_norm_gather_stats(): argument 'counts' must be t…

2fcee89

…uple of ints, but found element of type list at pos 1

fix TypeError [try 2]

e10b8a6

unlimblue added 2 commits June 30, 2019 12:03

Merge branch 'master' of https://github.com/pytorch/pytorch

38ae4a0

convert counts to Tensor

f89228b

pytorchbot added caffe2 module: infra Relates to CI infrastructure labels Jul 1, 2019

pytorchbot added module: onnx Related to torch.onnx module: pybind Related to our Python bindings / interactions with other Python libraries module: third_party labels Jul 1, 2019

add fix int[]

7c119db

facebook-github-bot reviewed Jul 1, 2019

View reviewed changes

mrshenli reviewed Jul 1, 2019

View reviewed changes

unlimblue added 2 commits July 2, 2019 11:58

fix backward compatibility

87cfacd

fix backward compatibility

e84e873

facebook-github-bot reviewed Jul 2, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

83ccf01

facebook-github-bot reviewed Jul 2, 2019

View reviewed changes

mrshenli approved these changes Jul 3, 2019

View reviewed changes

facebook-github-bot closed this in 29ec476 Jul 4, 2019

facebook-github-bot added the merged label Jul 4, 2019

jjsjann123 mentioned this pull request Nov 11, 2019

SyncBatchNorm should support 2D input (B, C) #20204

Open

jjsjann123 mentioned this pull request Apr 10, 2020

Fixing SyncBN dgrad #36382

Closed

gchanan mentioned this pull request Apr 15, 2020

[v1.5.0] Fixing SyncBN dgrad (#36382) #36688

Merged

mruberry added the Merged label Oct 28, 2020

zhanggefan mentioned this pull request Sep 4, 2021

naiveSyncBN on Waymo causes issues with samples_per_gpu = 2 open-mmlab/mmdetection3d#884

Closed

Conversation

unlimblue commented Jun 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix #22192

Uh oh!

mrshenli commented Jun 26, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

unlimblue commented Jun 26, 2019

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Jun 26, 2019

Choose a reason for hiding this comment

Uh oh!

unlimblue Jun 27, 2019

Choose a reason for hiding this comment

Uh oh!

unlimblue Jun 27, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Jun 27, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jun 28, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

unlimblue commented Jun 30, 2019

Uh oh!

unlimblue commented Jul 1, 2019

Uh oh!

mrshenli commented Jul 1, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

gchanan Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

ssnl commented Jul 1, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jul 2, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

unlimblue commented Jun 26, 2019 •

edited

Loading