Skip to content

speed up SyncBatchNorm by batching distributed communication#38246

Closed
vkuzo wants to merge 2 commits intogh/vkuzo/62/basefrom
gh/vkuzo/62/head
Closed

speed up SyncBatchNorm by batching distributed communication#38246
vkuzo wants to merge 2 commits intogh/vkuzo/62/basefrom
gh/vkuzo/62/head

Conversation

@vkuzo
Copy link
Copy Markdown
Contributor

@vkuzo vkuzo commented May 11, 2020

Stack from ghstack:

Summary:

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:

python test/run_test.py -v -i distributed/test_distributed
# there were some test failures, but they were also present in master

verified that before+after intermediate values in fwd/bwd pass are equivalent (with torch.allclose)

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node):

model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D21505905

Summary:

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15% improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 machine, 8x Tesla-P100):
```
model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@vkuzo vkuzo requested a review from apaszke as a code owner May 11, 2020 17:25
vkuzo added a commit that referenced this pull request May 11, 2020
Summary:

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15% improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 machine, 8x Tesla-P100):
```
model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 5298c5a
Pull Request resolved: #38246
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 11, 2020

💊 CI failures summary and remediations

As of commit 35aade5 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

May 11 20:10:37 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.
May 11 20:10:37 processing existing schema:  aten::std_mean(Tensor self, bool unbiased=True) -> (Tensor, Tensor) 
May 11 20:10:37 processing existing schema:  aten::std_mean.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) 
May 11 20:10:37 processing existing schema:  aten::std_mean.names_dim(Tensor self, str[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) 
May 11 20:10:37 processing existing schema:  aten::t(Tensor(a) self) -> (Tensor(a)) 
May 11 20:10:37 processing existing schema:  aten::t_(Tensor(a!) self) -> (Tensor(a!)) 
May 11 20:10:37 processing existing schema:  aten::tan_(Tensor(a!) self) -> (Tensor(a!)) 
May 11 20:10:37 processing existing schema:  aten::tanh_(Tensor(a!) self) -> (Tensor(a!)) 
May 11 20:10:37 processing existing schema:  aten::transpose_(Tensor(a!) self, int dim0, int dim1) -> (Tensor(a!)) 
May 11 20:10:37 processing existing schema:  aten::_trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> (Tensor) 
May 11 20:10:37 processing existing schema:  aten::type_as(Tensor self, Tensor other) -> (Tensor) 
May 11 20:10:37 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
May 11 20:10:37  
May 11 20:10:37 Broken ops: [ 
May 11 20:10:37 	aten::unfold_backward(Tensor grad_in, int[] input_sizes, int dim, int size, int step) -> (Tensor) 
May 11 20:10:37 	quantized::linear_prepack_legacy(Tensor W, Tensor? B=None) -> (Tensor W_prepack) 
May 11 20:10:37 	quantized::linear_prepack(Tensor W, Tensor? B=None) -> (__torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) 
May 11 20:10:37 	quantized::linear_prepack_fp16(Tensor W, Tensor? B=None) -> (__torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) 
May 11 20:10:37 	quantized::linear_prepack_fp16_legacy(Tensor W, Tensor? B=None) -> (Tensor W_prepack) 
May 11 20:10:37 	_quantized::linear_prepack(Tensor W, Tensor? B=None) -> (__torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) 
May 11 20:10:37 	_quantized::linear_dynamic(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) -> (Tensor Y) 
May 11 20:10:37 	_quantized::linear_prepack_legacy(Tensor W, Tensor? B=None) -> (Tensor W_prepack) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 4 times.

@vkuzo vkuzo requested review from mrshenli and ngimel May 11, 2020 18:46
Comment thread torch/nn/modules/_functions.py Outdated
Summary:

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:

```
python test/run_test.py -v -i distributed/test_distributed
# there were some test failures, but they were also present in master
```

verified that before+after intermediate values in fwd/bwd pass are equivalent (with `torch.allclose`)

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node):
```
model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21505905](https://our.internmc.facebook.com/intern/diff/D21505905)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request May 11, 2020
Summary:

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15% improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 machine, 8x Tesla-P100):
```
model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 519a488
Pull Request resolved: #38246
Comment thread torch/nn/modules/_functions.py
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in f64d24c.

combined_list = [
torch.empty_like(combined) for k in range(world_size)
]
# Use allgather instead of allreduce since I don't trust in-place operations ..
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuzo Wondering if there is a better reason than this not to use allreduce :) We use in-place allreduce operations a lot and in general its much faster than allgather.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd need to do a weighted sum here and not just all-reduce. Here the amount of data transfered is very small, so you'd be limited just by the latency of operation which should be approximately the same for allreduce and allgather.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely not an expert in this - after finding that this was a bottleneck, for this diff I took the approach of minimizing risk and copying the battle tested piece of the corresponding detectron2 layer (including this comment :) ).

do we have an estimate of performance savings of using allreduce? We can update it if it's worth it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel Was looking through this code again and wasn't sure about the weighted all-reduce part. Don't we need to find the global mean and variance for SyncBN and that can be done using all-reduce?

vkuzo added a commit that referenced this pull request Aug 30, 2020
Summary:

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Aug 30, 2020
Summary:

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: fb76ed7
Pull Request resolved: #43861
facebook-github-bot pushed a commit that referenced this pull request Sep 2, 2020
…43861)

Summary:
Pull Request resolved: #43861

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan:
CI

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23418816

fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…#38246)

Summary:
Pull Request resolved: pytorch#38246

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:
verified that before+after intermediate values in fwd/bwd pass are equivalent (with `torch.allclose`)

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node):
```
model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766
```

Imported from OSS

Differential Revision: D21505905

fbshipit-source-id: 3e796343fce8329a2e17671d60ae66c0387924e7
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ytorch#43861)

Summary:
Pull Request resolved: pytorch#43861

This is a redo of pytorch#38874, and
fixing my original bug from
pytorch#38246.

Test Plan:
CI

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23418816

fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants