speed up SyncBatchNorm by batching distributed communication by vkuzo · Pull Request #38246 · pytorch/pytorch

vkuzo · 2020-05-11T17:25:45Z

Stack from ghstack:

speed up SyncBatchNorm by batching distributed communication #38246 speed up SyncBatchNorm by batching distributed communication

Summary:

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:

python test/run_test.py -v -i distributed/test_distributed
# there were some test failures, but they were also present in master

verified that before+after intermediate values in fwd/bwd pass are equivalent (with torch.allclose)

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node):

model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D21505905

Summary: Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15% improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 machine, 8x Tesla-P100): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15% improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 machine, 8x Tesla-P100): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 5298c5a Pull Request resolved: #38246

dr-ci · 2020-05-11T17:59:48Z

💊 CI failures summary and remediations

As of commit 35aade5 (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

May 11 20:10:37 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.

May 11 20:10:37 processing existing schema:  aten::std_mean(Tensor self, bool unbiased=True) -> (Tensor, Tensor) 
May 11 20:10:37 processing existing schema:  aten::std_mean.dim(Tensor self, int[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) 
May 11 20:10:37 processing existing schema:  aten::std_mean.names_dim(Tensor self, str[1] dim, bool unbiased=True, bool keepdim=False) -> (Tensor, Tensor) 
May 11 20:10:37 processing existing schema:  aten::t(Tensor(a) self) -> (Tensor(a)) 
May 11 20:10:37 processing existing schema:  aten::t_(Tensor(a!) self) -> (Tensor(a!)) 
May 11 20:10:37 processing existing schema:  aten::tan_(Tensor(a!) self) -> (Tensor(a!)) 
May 11 20:10:37 processing existing schema:  aten::tanh_(Tensor(a!) self) -> (Tensor(a!)) 
May 11 20:10:37 processing existing schema:  aten::transpose_(Tensor(a!) self, int dim0, int dim1) -> (Tensor(a!)) 
May 11 20:10:37 processing existing schema:  aten::_trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> (Tensor) 
May 11 20:10:37 processing existing schema:  aten::type_as(Tensor self, Tensor other) -> (Tensor) 
May 11 20:10:37 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
May 11 20:10:37  
May 11 20:10:37 Broken ops: [ 
May 11 20:10:37 	aten::unfold_backward(Tensor grad_in, int[] input_sizes, int dim, int size, int step) -> (Tensor) 
May 11 20:10:37 	quantized::linear_prepack_legacy(Tensor W, Tensor? B=None) -> (Tensor W_prepack) 
May 11 20:10:37 	quantized::linear_prepack(Tensor W, Tensor? B=None) -> (__torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) 
May 11 20:10:37 	quantized::linear_prepack_fp16(Tensor W, Tensor? B=None) -> (__torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) 
May 11 20:10:37 	quantized::linear_prepack_fp16_legacy(Tensor W, Tensor? B=None) -> (Tensor W_prepack) 
May 11 20:10:37 	_quantized::linear_prepack(Tensor W, Tensor? B=None) -> (__torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) 
May 11 20:10:37 	_quantized::linear_dynamic(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) -> (Tensor Y) 
May 11 20:10:37 	_quantized::linear_prepack_legacy(Tensor W, Tensor? B=None) -> (Tensor W_prepack)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 4 times.

Summary: Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: ``` python test/run_test.py -v -i distributed/test_distributed # there were some test failures, but they were also present in master ``` verified that before+after intermediate values in fwd/bwd pass are equivalent (with `torch.allclose`) benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21505905](https://our.internmc.facebook.com/intern/diff/D21505905) [ghstack-poisoned]

Summary: Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15% improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 machine, 8x Tesla-P100): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 519a488 Pull Request resolved: #38246

facebook-github-bot · 2020-05-13T20:13:39Z

This pull request has been merged in f64d24c.

pritamdamania87 · 2020-06-01T21:55:14Z

+        combined_list = [
+            torch.empty_like(combined) for k in range(world_size)
+        ]
+        # Use allgather instead of allreduce since I don't trust in-place operations ..


@vkuzo Wondering if there is a better reason than this not to use allreduce :) We use in-place allreduce operations a lot and in general its much faster than allgather.

You'd need to do a weighted sum here and not just all-reduce. Here the amount of data transfered is very small, so you'd be limited just by the latency of operation which should be approximately the same for allreduce and allgather.

I'm definitely not an expert in this - after finding that this was a bottleneck, for this diff I took the approach of minimizing risk and copying the battle tested piece of the corresponding detectron2 layer (including this comment :) ).

do we have an estimate of performance savings of using allreduce? We can update it if it's worth it.

@ngimel Was looking through this code again and wasn't sure about the weighted all-reduce part. Don't we need to find the global mean and variance for SyncBN and that can be done using all-reduce?

Summary: This is a redo of #38874, and fixing my original bug from #38246. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: This is a redo of #38874, and fixing my original bug from #38246. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: fb76ed7 Pull Request resolved: #43861

…43861) Summary: Pull Request resolved: #43861 This is a redo of #38874, and fixing my original bug from #38246. Test Plan: CI Imported from OSS Reviewed By: supriyar Differential Revision: D23418816 fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb

…#38246) Summary: Pull Request resolved: pytorch#38246 Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: verified that before+after intermediate values in fwd/bwd pass are equivalent (with `torch.allclose`) benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Imported from OSS Differential Revision: D21505905 fbshipit-source-id: 3e796343fce8329a2e17671d60ae66c0387924e7

…ytorch#43861) Summary: Pull Request resolved: pytorch#43861 This is a redo of pytorch#38874, and fixing my original bug from pytorch#38246. Test Plan: CI Imported from OSS Reviewed By: supriyar Differential Revision: D23418816 fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb

vkuzo requested a review from apaszke as a code owner May 11, 2020 17:25

vkuzo requested review from mrshenli and ngimel May 11, 2020 18:46

ngimel approved these changes May 11, 2020

View reviewed changes

Comment thread torch/nn/modules/_functions.py Outdated

raghuramank100 reviewed May 11, 2020

View reviewed changes

Comment thread torch/nn/modules/_functions.py

facebook-github-bot closed this in f64d24c May 13, 2020

facebook-github-bot added the merged label May 13, 2020

facebook-github-bot deleted the gh/vkuzo/62/head branch May 17, 2020 14:18

linziyi96 mentioned this pull request May 21, 2020

Fix SyncBatchNorm forward pass for non-default process group #38874

Closed

pritamdamania87 reviewed Jun 1, 2020

View reviewed changes

vkuzo mentioned this pull request Aug 30, 2020

[redo] Fix SyncBatchNorm forward pass for non-default process group #43861

Closed

vkuzo added a commit that referenced this pull request Aug 30, 2020

Fix SyncBatchNorm forward pass for non-default process group

bc60649

Summary: This is a redo of #38874, and fixing my original bug from #38246. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up SyncBatchNorm by batching distributed communication#38246

speed up SyncBatchNorm by batching distributed communication#38246
vkuzo wants to merge 2 commits intogh/vkuzo/62/basefrom
gh/vkuzo/62/head

vkuzo commented May 11, 2020 •

edited

Loading

Uh oh!

dr-ci Bot commented May 11, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented May 13, 2020

Uh oh!

pritamdamania87 Jun 1, 2020

Uh oh!

ngimel Jun 1, 2020

Uh oh!

vkuzo Jun 1, 2020

Uh oh!

pritamdamania87 Sep 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

vkuzo commented May 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented May 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_backward_compatibility_check_test (1/1)

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented May 13, 2020

Uh oh!

pritamdamania87 Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

vkuzo Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Sep 9, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vkuzo commented May 11, 2020 •

edited

Loading

dr-ci Bot commented May 11, 2020 •

edited

Loading