speed up SyncBatchNorm by batching distributed communication#38246
speed up SyncBatchNorm by batching distributed communication#38246vkuzo wants to merge 2 commits intogh/vkuzo/62/basefrom
Conversation
Summary: Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15% improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 machine, 8x Tesla-P100): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15% improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 machine, 8x Tesla-P100): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 5298c5a Pull Request resolved: #38246
💊 CI failures summary and remediationsAs of commit 35aade5 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Summary: Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: ``` python test/run_test.py -v -i distributed/test_distributed # there were some test failures, but they were also present in master ``` verified that before+after intermediate values in fwd/bwd pass are equivalent (with `torch.allclose`) benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21505905](https://our.internmc.facebook.com/intern/diff/D21505905) [ghstack-poisoned]
Summary: Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15% improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 machine, 8x Tesla-P100): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 519a488 Pull Request resolved: #38246
|
This pull request has been merged in f64d24c. |
| combined_list = [ | ||
| torch.empty_like(combined) for k in range(world_size) | ||
| ] | ||
| # Use allgather instead of allreduce since I don't trust in-place operations .. |
There was a problem hiding this comment.
@vkuzo Wondering if there is a better reason than this not to use allreduce :) We use in-place allreduce operations a lot and in general its much faster than allgather.
There was a problem hiding this comment.
You'd need to do a weighted sum here and not just all-reduce. Here the amount of data transfered is very small, so you'd be limited just by the latency of operation which should be approximately the same for allreduce and allgather.
There was a problem hiding this comment.
I'm definitely not an expert in this - after finding that this was a bottleneck, for this diff I took the approach of minimizing risk and copying the battle tested piece of the corresponding detectron2 layer (including this comment :) ).
do we have an estimate of performance savings of using allreduce? We can update it if it's worth it.
There was a problem hiding this comment.
@ngimel Was looking through this code again and wasn't sure about the weighted all-reduce part. Don't we need to find the global mean and variance for SyncBN and that can be done using all-reduce?
…#38246) Summary: Pull Request resolved: pytorch#38246 Speeds up SyncBatchNorm by batching the distributed communication. Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and EfficientNetB3 on a single machine with 8 gpus. Improvement vs baseline increases as # of gpus increases. Test Plan: verified that before+after intermediate values in fwd/bwd pass are equivalent (with `torch.allclose`) benchmark runner: https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node): ``` model gpus before_ms after_ms speedup efficientnet-b3 2 660 654 0.00909 efficientnet-b3 4 777 710 0.08623 efficientnet-b3 8 988 838 0.15182 mobilenet-v2 2 267 266 0.00375 mobilenet-v2 4 328 289 0.1189 mobilenet-v2 8 453 373 0.1766 ``` Imported from OSS Differential Revision: D21505905 fbshipit-source-id: 3e796343fce8329a2e17671d60ae66c0387924e7
…ytorch#43861) Summary: Pull Request resolved: pytorch#43861 This is a redo of pytorch#38874, and fixing my original bug from pytorch#38246. Test Plan: CI Imported from OSS Reviewed By: supriyar Differential Revision: D23418816 fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb
Stack from ghstack:
Summary:
Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.
Test Plan:
verified that before+after intermediate values in fwd/bwd pass are equivalent (with
torch.allclose)benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f
results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node):
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D21505905