Skip to content

Optimization of Backward Implementation for Learnable Fake Quantize Per Tensor Kernels (CPU)#42384

Closed
paulshaoyuqiao wants to merge 1 commit intopytorch:masterfrom
paulshaoyuqiao:export-D22875998
Closed

Optimization of Backward Implementation for Learnable Fake Quantize Per Tensor Kernels (CPU)#42384
paulshaoyuqiao wants to merge 1 commit intopytorch:masterfrom
paulshaoyuqiao:export-D22875998

Conversation

@paulshaoyuqiao
Copy link
Copy Markdown

Summary:
In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by strides).

In the benchmark test on the operators, for an input of shape 3x3x256x256, we have observed the following improvement in performance:

  • original python operator: 1021037 microseconds
  • original learnable kernel: 407576 microseconds
  • optimized learnable kernel: 102584 microseconds
  • original non-backprop kernel: 139806 microseconds

Speedup from python operator: 995%
Speedup from original learnable kernel: 397%
Speedup from non-backprop kernel: 26.2%

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

buck test //caffe2/test:quantization -- learnable_backward_per_tensor

To benchmark the operators, on a devvm, enter the command

buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test

Differential Revision: D22875998

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D22875998

@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Jul 31, 2020

💊 CI failures summary and remediations

As of commit 47477ca (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D22875998

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D22875998

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D22875998

…er Tensor Kernels (CPU and GPU) (pytorch#42384)

Summary:
Pull Request resolved: pytorch#42384

In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`).

In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
- original python operator: 1021037 microseconds
- original learnable kernel: 407576 microseconds
- optimized learnable kernel: 102584 microseconds
- original non-backprop kernel: 139806 microseconds

**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~4x
**Speedup from non-backprop kernel**: ~1.2x

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

`buck test //caffe2/test:quantization -- learnable_backward_per_tensor`

To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs are as follows:

```
# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 1021036.957

# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 102583.693

# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 139806.086
```

Reviewed By: vkuzo

Differential Revision: D22875998

fbshipit-source-id: 96fc9511f935030756f2c11310ee3888abe89dca
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D22875998

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 9152f2f.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…er Tensor Kernels (CPU and GPU) (pytorch#42384)

Summary:
Pull Request resolved: pytorch#42384

In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`).

In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
- original python operator: 1021037 microseconds
- original learnable kernel: 407576 microseconds
- optimized learnable kernel: 102584 microseconds
- original non-backprop kernel: 139806 microseconds

**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~4x
**Speedup from non-backprop kernel**: ~1.2x

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

`buck test //caffe2/test:quantization -- learnable_backward_per_tensor`

To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs are as follows:

(CPU)
```
# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 1021036.957

# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 102583.693

# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 139806.086
```

(GPU)
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: py_module
Backward Execution Time (us) : 6548.350

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: learnable_kernel
Backward Execution Time (us) : 1340.724

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: original_kernel
Backward Execution Time (us) : 656.863
```

Reviewed By: vkuzo

Differential Revision: D22875998

fbshipit-source-id: cfcd62c327bb622270a783d2cbe97f00508c4a16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants