Optimization of Backward Implementation for Learnable Fake Quantize Per Tensor Kernels (CPU) by paulshaoyuqiao · Pull Request #42384 · pytorch/pytorch

paulshaoyuqiao · 2020-07-31T20:51:25Z

Summary:
In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by strides).

In the benchmark test on the operators, for an input of shape 3x3x256x256, we have observed the following improvement in performance:

original python operator: 1021037 microseconds
original learnable kernel: 407576 microseconds
optimized learnable kernel: 102584 microseconds
original non-backprop kernel: 139806 microseconds

Speedup from python operator: 995%
Speedup from original learnable kernel: 397%
Speedup from non-backprop kernel: 26.2%

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

buck test //caffe2/test:quantization -- learnable_backward_per_tensor

To benchmark the operators, on a devvm, enter the command

buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test

Differential Revision: D22875998

facebook-github-bot · 2020-07-31T20:51:41Z

This pull request was exported from Phabricator. Differential Revision: D22875998

dr-ci · 2020-07-31T22:15:06Z

💊 CI failures summary and remediations

As of commit 47477ca (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

facebook-github-bot · 2020-08-04T16:32:06Z

This pull request was exported from Phabricator. Differential Revision: D22875998

facebook-github-bot · 2020-08-06T18:05:27Z

This pull request was exported from Phabricator. Differential Revision: D22875998

facebook-github-bot · 2020-08-06T23:28:14Z

This pull request was exported from Phabricator. Differential Revision: D22875998

…er Tensor Kernels (CPU and GPU) (pytorch#42384) Summary: Pull Request resolved: pytorch#42384 In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance: - original python operator: 1021037 microseconds - original learnable kernel: 407576 microseconds - optimized learnable kernel: 102584 microseconds - original non-backprop kernel: 139806 microseconds **Speedup from python operator**: ~10x **Speedup from original learnable kernel**: ~4x **Speedup from non-backprop kernel**: ~1.2x Test Plan: To assert correctness of the new kernel, on a devvm, enter the command `buck test //caffe2/test:quantization -- learnable_backward_per_tensor` To benchmark the operators, on a devvm, enter the command 1. Set the kernel size to 3x3x256x256 or a reasonable input size. 2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test` 3. The relevant outputs are as follows: ``` # Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark # Mode: Eager # Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module Backward Execution Time (us) : 1021036.957 # Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark # Mode: Eager # Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel Backward Execution Time (us) : 102583.693 # Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark # Mode: Eager # Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel Backward Execution Time (us) : 139806.086 ``` Reviewed By: vkuzo Differential Revision: D22875998 fbshipit-source-id: 96fc9511f935030756f2c11310ee3888abe89dca

facebook-github-bot · 2020-08-06T23:36:55Z

This pull request was exported from Phabricator. Differential Revision: D22875998

facebook-github-bot · 2020-08-07T04:11:52Z

This pull request has been merged in 9152f2f.

…er Tensor Kernels (CPU and GPU) (pytorch#42384) Summary: Pull Request resolved: pytorch#42384 In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance: - original python operator: 1021037 microseconds - original learnable kernel: 407576 microseconds - optimized learnable kernel: 102584 microseconds - original non-backprop kernel: 139806 microseconds **Speedup from python operator**: ~10x **Speedup from original learnable kernel**: ~4x **Speedup from non-backprop kernel**: ~1.2x Test Plan: To assert correctness of the new kernel, on a devvm, enter the command `buck test //caffe2/test:quantization -- learnable_backward_per_tensor` To benchmark the operators, on a devvm, enter the command 1. Set the kernel size to 3x3x256x256 or a reasonable input size. 2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test` 3. The relevant outputs are as follows: (CPU) ``` # Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark # Mode: Eager # Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module Backward Execution Time (us) : 1021036.957 # Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark # Mode: Eager # Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel Backward Execution Time (us) : 102583.693 # Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark # Mode: Eager # Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel Backward Execution Time (us) : 139806.086 ``` (GPU) ``` # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: py_module Backward Execution Time (us) : 6548.350 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: learnable_kernel Backward Execution Time (us) : 1340.724 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: original_kernel Backward Execution Time (us) : 656.863 ``` Reviewed By: vkuzo Differential Revision: D22875998 fbshipit-source-id: cfcd62c327bb622270a783d2cbe97f00508c4a16

facebook-github-bot added the fb-exported label Jul 31, 2020

paulshaoyuqiao force-pushed the export-D22875998 branch from 4104714 to af87bbe Compare August 4, 2020 16:32

paulshaoyuqiao force-pushed the export-D22875998 branch from af87bbe to f9d720c Compare August 6, 2020 18:05

paulshaoyuqiao force-pushed the export-D22875998 branch from f9d720c to 6d35294 Compare August 6, 2020 23:28

paulshaoyuqiao force-pushed the export-D22875998 branch from 6d35294 to 47477ca Compare August 6, 2020 23:36

facebook-github-bot closed this in 9152f2f Aug 7, 2020

facebook-github-bot added the merged label Aug 7, 2020

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of Backward Implementation for Learnable Fake Quantize Per Tensor Kernels (CPU)#42384

Optimization of Backward Implementation for Learnable Fake Quantize Per Tensor Kernels (CPU)#42384
paulshaoyuqiao wants to merge 1 commit intopytorch:masterfrom
paulshaoyuqiao:export-D22875998

paulshaoyuqiao commented Jul 31, 2020

Uh oh!

facebook-github-bot commented Jul 31, 2020

Uh oh!

dr-ci Bot commented Jul 31, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 4, 2020

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

facebook-github-bot commented Aug 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

paulshaoyuqiao commented Jul 31, 2020

Uh oh!

facebook-github-bot commented Jul 31, 2020

Uh oh!

dr-ci Bot commented Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

facebook-github-bot commented Aug 4, 2020

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

facebook-github-bot commented Aug 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dr-ci Bot commented Jul 31, 2020 •

edited

Loading