Fix SyncBatchNorm for empty inputs by mrshenli · Pull Request #74944 · pytorch/pytorch

mrshenli · 2022-03-30T04:03:38Z

Stack from ghstack:

Fix SyncBatchNorm for empty inputs #74944 Fix SyncBatchNorm for empty inputs

Prior to this commit, SyncBatchNorm crashes with the following
error message.

File "..../torch/nn/modules/_functions.py", line 17, in forward
    mean, invstd = torch.batch_norm_stats(input, eps)
RuntimeError: cannot reshape tensor of 0 elements into shape [0, 3, -1] because the unspecified dimension size -1 can be any value and is ambiguous

This PR adds a dedicated branch to handle empty inputs. When a process
recieves empty inputs, it will set its local mean, invstd, and count
to zero, and participate in the all_gather collective communications in
the forward pass. Then mean and invstd with zero count will be
filtered out before computing global mean and invstd. In the backward
pass, it also participate in the all_reduce communication with zero
tensors to unblock its peers.

Differential Revision: D35273409

TODO: 1. avoid copying count_all to CPU if possible 2. it's not crashed any more, but the output is nan Next step will try to move the fix to the CUDA kernel of `batch_norm_gather_stats_with_counts` accordingly [ghstack-poisoned]

facebook-github-bot · 2022-03-30T04:03:44Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74944
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
❓Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit d5f20a8 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-bionic-rocm5.0-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-04-01T16:04:53.0503226Z FAIL [0.023s]: tes...id_sampler_cuda (__main__.TestTorchDeviceTypeCUDA)

2022-04-01T16:04:53.0264233Z   test_where_scalar_valid_combination_cuda_int32 (__main__.TestTorchDeviceTypeCUDA) ... ok (0.005s)
2022-04-01T16:04:53.0388759Z   test_where_scalar_valid_combination_cuda_int64 (__main__.TestTorchDeviceTypeCUDA) ... ok (0.012s)
2022-04-01T16:04:53.0436659Z   test_where_scalar_valid_combination_cuda_int8 (__main__.TestTorchDeviceTypeCUDA) ... ok (0.005s)
2022-04-01T16:04:53.0488375Z   test_where_scalar_valid_combination_cuda_uint8 (__main__.TestTorchDeviceTypeCUDA) ... ok (0.005s)
2022-04-01T16:04:53.0494956Z   test_cuda_vitals_gpu_only_cuda (__main__.TestVitalSignsCudaCUDA) ... [TORCH_VITAL] Dataloader.enabled		 True
2022-04-01T16:04:53.0499586Z [TORCH_VITAL] Dataloader.basic_unit_test		 TEST_VALUE_STRING
2022-04-01T16:04:53.0500440Z [TORCH_VITAL] CUDA.used		 true
2022-04-01T16:04:53.0501585Z ok (0.001s)
2022-04-01T16:04:53.0501944Z 
2022-04-01T16:04:53.0502257Z ======================================================================
2022-04-01T16:04:53.0503226Z FAIL [0.023s]: test_invalid_shapes_grid_sampler_cuda (__main__.TestTorchDeviceTypeCUDA)
2022-04-01T16:04:53.0504771Z ----------------------------------------------------------------------
2022-04-01T16:04:53.0505843Z RuntimeError: cudnn_grid_sampler_forward: ATen not compiled with cuDNN support
2022-04-01T16:04:53.0506458Z 
2022-04-01T16:04:53.0506937Z During handling of the above exception, another exception occurred:
2022-04-01T16:04:53.0510638Z 
2022-04-01T16:04:53.0511284Z Traceback (most recent call last):
2022-04-01T16:04:53.0513131Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1780, in wrapper
2022-04-01T16:04:53.0514459Z     method(*args, **kwargs)
2022-04-01T16:04:53.0515972Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
2022-04-01T16:04:53.0517465Z     result = test(self, **param_kwargs)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

TODO: 1. avoid copying count_all to CPU if possible 2. it's not crashed any more, but the output is nan Next step will try to move the fix to the CUDA kernel of `batch_norm_gather_stats_with_counts` accordingly ghstack-source-id: ff6e423 Pull Request resolved: #74944

test/distributed/test_c10d_common.py

datumbox

Thanks for the investigation @mrshenli. I've had a look as well and added a few comments. Let me know your thoughts.

torch/nn/modules/_functions.py

[ghstack-poisoned]

ghstack-source-id: a82e2dc Pull Request resolved: #74944

mrshenli · 2022-03-31T03:24:33Z

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

datumbox

Thanks for the change @mrshenli.

Overall the approach looks good to me. I've added minor comments for nits. I'm currently testing this patch on a cluster using real data and it seems that the problem is resolved. If something breaks, I'll let you know.

torch/nn/modules/_functions.py

albanD · 2022-03-31T13:32:53Z

torch/nn/modules/_functions.py

+            combined = torch.cat([mean, invstd, count], dim=0)
+        else:
+            # for empty input, directly set all stats to 0
+            combined = torch.zeros(


Wouldn't something like: torch.zeros(dtype=input.dtype, device=input.device).expand(2 * num_channels + 1) also work and reduce the bandwidth that is wasted?
Not sure how the rpc is handling non-contiguous Tensors.

torch.zeros(dtype=input.dtype, device=input.device).expand(2 * num_channels + 1)

Curious, what bandwidth does the above code save? And why RPC is relevant here?

This "combined" Tensor is shared with all other nodes during the all reduce below right?
And while the Tensor in the code today has 2 * num_channels + 1 elements (that need to go through the wire), the expanded version has 1 element. So if it is sent over the wire effectively, you save a lot of bandwidth.

Oh I see. Not sure if this gonna work. Collectives use ProcessGroup and will call NCCL APIs under the hood. IIRC, NCCL expects contiguous tensors and will directly read numel() elements from the memory pointer. Let me double check on that

File "/raid/shenli/pytorch/torch/distributed/distributed_c10d.py", line 2130, in _all_gather_base work = group._allgather_base(output_tensor, input_tensor) RuntimeError: Tensors must be contiguous Exception raised from check_gpu_single_tensor at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1227 (most recent call first):

Hit the above error, caused by the following line.

pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Lines 1226 to 1228 in 835cc66

if (!tensor.is_contiguous()) {

TORCH_CHECK(false, "Tensors must be contiguous");

}

Ok then.
As a side note, I think you should look into that as it is potentially a major bandwidth gain (and if I understand correctly, this is an expensive commodity).

albanD · 2022-03-31T13:34:54Z

torch/nn/modules/_functions.py

+            num_channels = saved_input.shape[1]
+            if self.needs_input_grad[0]:
+                # launch all_reduce to unblock other peer processes
+                combined = torch.zeros(


Same question about expanded Tensor to reduce bandwidth use

torch/nn/modules/_functions.py

fixes #36530 Prior to this commit, SyncBatchNorm crashes with the following error message. ``` File "..../torch/nn/modules/_functions.py", line 17, in forward mean, invstd = torch.batch_norm_stats(input, eps) RuntimeError: cannot reshape tensor of 0 elements into shape [0, 3, -1] because the unspecified dimension size -1 can be any value and is ambiguous ``` This PR adds a dedicated branch to handle empty inputs. When a process recieves empty inputs, it will set its local `mean`, `invstd`, and `count` to zero, and participate in the `all_gather` collective communications in the forward pass. Then `mean` and `invstd` with zero count will be filtered out before computing global mean and invstd. In the backward pass, it also participate in the `all_reduce` communication with zero tensors to unblock its peers. Differential Revision: [D35273409](https://our.internmc.facebook.com/intern/diff/D35273409) [ghstack-poisoned]

pytorch-bot · 2022-04-01T02:04:42Z

ci/master label does not do anything. Did you mean ciflow/trunk?

mrshenli · 2022-04-01T02:08:52Z

test/distributed/test_c10d_common.py

+
+        # input does not requires grad
+        x.requires_grad = False
+        self._test_not_nan(model, x)


@datumbox @albanD is there a way to test the grad value as well?

If I feed batch size 0 and 2 to the two processes, will it generate the same gradients if they receive batch size of 1 and 1 respective of the same data? I assume no, because invstd is no longer guaranteed to be the same in this case?

I think I agree. It's not going to be the same gradient because the minibatch statistics will be different in the two cases.

@datumbox I don't get why you say "the minibatch statistics will be different" - in SyncBatchNorm the minibatch is all samples on all workers, so in @mrshenli 's example it's the same minibatch (2 samples) in both cases.

fixes #36530 Prior to this commit, SyncBatchNorm crashes with the following error message. ``` File "..../torch/nn/modules/_functions.py", line 17, in forward mean, invstd = torch.batch_norm_stats(input, eps) RuntimeError: cannot reshape tensor of 0 elements into shape [0, 3, -1] because the unspecified dimension size -1 can be any value and is ambiguous ``` This PR adds a dedicated branch to handle empty inputs. When a process recieves empty inputs, it will set its local `mean`, `invstd`, and `count` to zero, and participate in the `all_gather` collective communications in the forward pass. Then `mean` and `invstd` with zero count will be filtered out before computing global mean and invstd. In the backward pass, it also participate in the `all_reduce` communication with zero tensors to unblock its peers. Differential Revision: [D35273409](https://our.internmc.facebook.com/intern/diff/D35273409) [ghstack-poisoned]

fixes #36530 Prior to this commit, SyncBatchNorm crashes with the following error message. ``` File "..../torch/nn/modules/_functions.py", line 17, in forward mean, invstd = torch.batch_norm_stats(input, eps) RuntimeError: cannot reshape tensor of 0 elements into shape [0, 3, -1] because the unspecified dimension size -1 can be any value and is ambiguous ``` This PR adds a dedicated branch to handle empty inputs. When a process recieves empty inputs, it will set its local `mean`, `invstd`, and `count` to zero, and participate in the `all_gather` collective communications in the forward pass. Then `mean` and `invstd` with zero count will be filtered out before computing global mean and invstd. In the backward pass, it also participate in the `all_reduce` communication with zero tensors to unblock its peers. ghstack-source-id: b060e51 Pull Request resolved: #74944

mrshenli · 2022-04-01T02:12:12Z

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

fixes #36530 Prior to this commit, SyncBatchNorm crashes with the following error message. ``` File "..../torch/nn/modules/_functions.py", line 17, in forward mean, invstd = torch.batch_norm_stats(input, eps) RuntimeError: cannot reshape tensor of 0 elements into shape [0, 3, -1] because the unspecified dimension size -1 can be any value and is ambiguous ``` This PR adds a dedicated branch to handle empty inputs. When a process recieves empty inputs, it will set its local `mean`, `invstd`, and `count` to zero, and participate in the `all_gather` collective communications in the forward pass. Then `mean` and `invstd` with zero count will be filtered out before computing global mean and invstd. In the backward pass, it also participate in the `all_reduce` communication with zero tensors to unblock its peers. Differential Revision: [D35273409](https://our.internmc.facebook.com/intern/diff/D35273409) [ghstack-poisoned]

fixes #36530 Prior to this commit, SyncBatchNorm crashes with the following error message. ``` File "..../torch/nn/modules/_functions.py", line 17, in forward mean, invstd = torch.batch_norm_stats(input, eps) RuntimeError: cannot reshape tensor of 0 elements into shape [0, 3, -1] because the unspecified dimension size -1 can be any value and is ambiguous ``` This PR adds a dedicated branch to handle empty inputs. When a process recieves empty inputs, it will set its local `mean`, `invstd`, and `count` to zero, and participate in the `all_gather` collective communications in the forward pass. Then `mean` and `invstd` with zero count will be filtered out before computing global mean and invstd. In the backward pass, it also participate in the `all_reduce` communication with zero tensors to unblock its peers. ghstack-source-id: d59971b Pull Request resolved: #74944

mrshenli · 2022-04-01T15:08:37Z

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

datumbox

LGTM from my side. My tests on real-data show that the issue is fixed.

albanD

SGTM

Summary: Pull Request resolved: #74944 fixes #36530 Prior to this commit, SyncBatchNorm crashes with the following error message. ``` File "..../torch/nn/modules/_functions.py", line 17, in forward mean, invstd = torch.batch_norm_stats(input, eps) RuntimeError: cannot reshape tensor of 0 elements into shape [0, 3, -1] because the unspecified dimension size -1 can be any value and is ambiguous ``` This PR adds a dedicated branch to handle empty inputs. When a process recieves empty inputs, it will set its local `mean`, `invstd`, and `count` to zero, and participate in the `all_gather` collective communications in the forward pass. Then `mean` and `invstd` with zero count will be filtered out before computing global mean and invstd. In the backward pass, it also participate in the `all_reduce` communication with zero tensors to unblock its peers. Differential Revision: D35273409 D35273409 Test Plan: Imported from OSS Reviewed By: datumbox Pulled By: mrshenli fbshipit-source-id: 1cee51eea866773c329b3fbf5da2be8a5fee6f0f

github-actions · 2022-04-01T23:49:01Z

Hey @mrshenli.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

[WIP] Fix SyncBatchNorm for empty inputs

99c4f90

TODO: 1. avoid copying count_all to CPU if possible 2. it's not crashed any more, but the output is nan Next step will try to move the fix to the CUDA kernel of `batch_norm_gather_stats_with_counts` accordingly [ghstack-poisoned]

mrshenli requested review from H-Huang, albanD, awgu, jbschlosser, pritamdamania87, rohan-varma and zhaojuanmao as code owners March 30, 2022 04:03

facebook-github-bot added the cla signed label Mar 30, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 30, 2022

datumbox reviewed Mar 30, 2022

View reviewed changes

test/distributed/test_c10d_common.py Outdated Show resolved Hide resolved

datumbox reviewed Mar 30, 2022

View reviewed changes

torch/nn/modules/_functions.py Outdated Show resolved Hide resolved

torch/nn/modules/_functions.py Show resolved Hide resolved

torch/nn/modules/_functions.py Outdated Show resolved Hide resolved

torch/nn/modules/_functions.py Show resolved Hide resolved

Update on "[WIP] Fix SyncBatchNorm for empty inputs"

53ea0ca

[ghstack-poisoned]

mrshenli added a commit that referenced this pull request Mar 31, 2022

[WIP] Fix SyncBatchNorm for empty inputs

a56aa97

ghstack-source-id: a82e2dc Pull Request resolved: #74944

datumbox reviewed Mar 31, 2022

View reviewed changes

torch/nn/modules/_functions.py Outdated Show resolved Hide resolved

torch/nn/modules/_functions.py Outdated Show resolved Hide resolved

torch/nn/modules/_functions.py Show resolved Hide resolved

albanD reviewed Mar 31, 2022

View reviewed changes

mrshenli changed the title ~~[WIP] Fix SyncBatchNorm for empty inputs~~ Fix SyncBatchNorm for empty inputs Apr 1, 2022

mrshenli added the ci/master label Apr 1, 2022

mrshenli commented Apr 1, 2022

View reviewed changes

datumbox approved these changes Apr 1, 2022

View reviewed changes

albanD approved these changes Apr 1, 2022

View reviewed changes

pytorchmergebot closed this in 87ab665 Apr 1, 2022

datumbox mentioned this pull request Apr 4, 2022

[RFC] Batteries Included - Phase 2 pytorch/vision#5410

Closed

24 tasks

facebook-github-bot deleted the gh/mrshenli/341/head branch April 5, 2022 14:17

jbschlosser added release notes: distributed (ddp) release notes category topic: bug fixes topic category labels Apr 18, 2022

mcarilli mentioned this pull request May 31, 2022

https://github.com/pytorch/pytorch/pull/74944 breaks CUDA graph capture #78549

Closed

mrshenli mentioned this pull request Jun 1, 2022

Allow batch_norm_backward_elemt and batch_norm_gather_stats_with_counts handle 0 counts #78656

Open

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

	if (!tensor.is_contiguous()) {
	TORCH_CHECK(false, "Tensors must be contiguous");
	}

Conversation

mrshenli commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pull / linux-bionic-rocm5.0-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Uh oh!

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli commented Mar 31, 2022

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Apr 1, 2022

Uh oh!

mrshenli commented Apr 1, 2022

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mrshenli commented Mar 30, 2022 •

edited

Loading

facebook-github-bot commented Mar 30, 2022 •

edited

Loading