Speed up HistogramObserver by vectorizing critical path by durumu · Pull Request #41041 · pytorch/pytorch

durumu · 2020-07-06T21:52:40Z

Roughly a 22x speedup over the code this replaces when tested on ResNet18 on a devvm using CPU only, using default parameters for HistogramObserver (i.e. 2048 bins). The script I ran to test this is here.

Roughly a 14x speedup when tested using the benchmark from #42138 (also CPU only).

Stack from ghstack:

Speed up HistogramObserver by vectorizing critical path #41041 Speed up HistogramObserver by vectorizing critical path
Add benchmark for calculate_qparams #42138 Add benchmark for calculate_qparams

Differential Revision: D22400755

[ghstack-poisoned]

ghstack-source-id: baa90dd Pull Request resolved: #41041

dr-ci · 2020-07-06T22:03:30Z

💊 CI failures summary and remediations

As of commit 11ce38b (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 17 times.

vkuzo · 2020-07-06T22:36:31Z

nice! In general looks good, can we just add to the test plan:

how did we measure performance
how did we verify numerical correctness

Differential Revision: [D22400755](https://our.internmc.facebook.com/intern/diff/D22400755) [ghstack-poisoned]

ghstack-source-id: c11cbec Pull Request resolved: #41041

mruberry · 2020-07-09T03:50:05Z

-                    norm = norm + _get_norm(delta_begin, delta_end, density, norm_type)
-            return norm
+
+            src_bin = torch.arange(self.bins).numpy()


PyTorch doesn't have a NumPy dependency for its functionality (although we do for some tests), and we shouldn't use NumPy functionality in lieu of our own. Uses of NumPy should be restricted to testing and NumPy interop.

Thanks for the feedback -- I changed my code to get rid of the numpy dependency.

raghuramank100 · 2020-07-09T19:47:00Z

-                    delta_end = src_bin_end - dst_bin_of_end_center
-                    norm = norm + _get_norm(delta_begin, delta_end, density, norm_type)
-            return norm
+


This can be optimized further by the following approximation:
Quantization error = (StepSize^2/12)Q + sum(P[i](BinCenter[i]-next_start_bin)^2) + sum(Pi]*(BinCenter[i] - end_start_bin)^2).
Q = sum(hist[next_start_bin:next_end_bin])

where the first sum is over the bins less than the start_bin and the second sum is over bins greater than the end bin. In this approximation, we only need to compute two indices: Where do the next_start_bin and next_end_bin map to in terms of the original histogram indices

Differential Revision: [D22400755](https://our.internmc.facebook.com/intern/diff/D22400755) [ghstack-poisoned]

ghstack-source-id: 1a86490 Pull Request resolved: #41041

Differential Revision: [D22400755](https://our.internmc.facebook.com/intern/diff/D22400755) [ghstack-poisoned]

ghstack-source-id: d6698c7 Pull Request resolved: #41041

facebook-github-bot · 2020-08-07T19:43:21Z

@durumu merged this pull request in 7332c21.

Summary: 22x speedup over the code this replaces. Tested on ResNet18 on a devvm using CPU only, using default parameters for HistogramObserver (i.e. 2048 bins). Pull Request resolved: pytorch#41041 Test Plan: To run the test against the reference (old) implementation, you can use `python test/test_quantization.py TestRecordHistogramObserver.test_histogram_observer_against_reference`. To run the benchmark, while in the folder `benchmarks/operator_benchmark`, you can use `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`. Benchmark results before speedup: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine Forward Execution Time (us) : 185818.566 # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric Forward Execution Time (us) : 165325.916 ``` Benchmark results after speedup: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine Forward Execution Time (us) : 12242.241 # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric Forward Execution Time (us) : 12655.354 ``` Reviewed By: raghuramank100 Differential Revision: D22400755 Pulled By: durumu fbshipit-source-id: 639ac796a554710a33c8a930c1feae95a1148718

Speed up HistogramObserver by vectorizing critical path

385aaf0

[ghstack-poisoned]

durumu added a commit that referenced this pull request Jul 6, 2020

Speed up HistogramObserver by vectorizing critical path

54baf1a

ghstack-source-id: baa90dd Pull Request resolved: #41041

Update on "Speed up HistogramObserver by vectorizing critical path"

667fceb

Differential Revision: [D22400755](https://our.internmc.facebook.com/intern/diff/D22400755) [ghstack-poisoned]

durumu added a commit that referenced this pull request Jul 7, 2020

Speed up HistogramObserver by vectorizing critical path

b8eb3a7

ghstack-source-id: c11cbec Pull Request resolved: #41041

mruberry reviewed Jul 9, 2020

View reviewed changes

raghuramank100 reviewed Jul 9, 2020

View reviewed changes

Update on "Speed up HistogramObserver by vectorizing critical path"

78c133e

Differential Revision: [D22400755](https://our.internmc.facebook.com/intern/diff/D22400755) [ghstack-poisoned]

durumu added a commit that referenced this pull request Jul 24, 2020

Speed up HistogramObserver by vectorizing critical path

9448f1d

ghstack-source-id: 1a86490 Pull Request resolved: #41041

Update on "Speed up HistogramObserver by vectorizing critical path"

26cbd00

Differential Revision: [D22400755](https://our.internmc.facebook.com/intern/diff/D22400755) [ghstack-poisoned]

Update on "Speed up HistogramObserver by vectorizing critical path"

4e686e7

Differential Revision: [D22400755](https://our.internmc.facebook.com/intern/diff/D22400755) [ghstack-poisoned]

Update on "Speed up HistogramObserver by vectorizing critical path"

11ce38b

Differential Revision: [D22400755](https://our.internmc.facebook.com/intern/diff/D22400755) [ghstack-poisoned]

durumu mentioned this pull request Jul 27, 2020

Add benchmark for calculate_qparams #42138

Closed

durumu added a commit that referenced this pull request Jul 27, 2020

Speed up HistogramObserver by vectorizing critical path

ee9e42d

ghstack-source-id: d6698c7 Pull Request resolved: #41041

facebook-github-bot closed this in 7332c21 Aug 7, 2020

facebook-github-bot added the merged label Aug 7, 2020

facebook-github-bot deleted the gh/durumu/9/head branch August 11, 2020 14:16

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up HistogramObserver by vectorizing critical path#41041

Speed up HistogramObserver by vectorizing critical path#41041
durumu wants to merge 6 commits intogh/durumu/9/basefrom
gh/durumu/9/head

durumu commented Jul 6, 2020 •

edited

Loading

Uh oh!

dr-ci Bot commented Jul 6, 2020 •

edited

Loading

Uh oh!

vkuzo commented Jul 6, 2020

Uh oh!

mruberry Jul 9, 2020

Uh oh!

durumu Jul 24, 2020

Uh oh!

raghuramank100 Jul 9, 2020

Uh oh!

facebook-github-bot commented Aug 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

durumu commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

vkuzo commented Jul 6, 2020

Uh oh!

mruberry Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

durumu Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

raghuramank100 Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

durumu commented Jul 6, 2020 •

edited

Loading

dr-ci Bot commented Jul 6, 2020 •

edited

Loading