Skip to content

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern#142036

Closed
Xia-Weiwen wants to merge 7 commits intogh/Xia-Weiwen/23/basefrom
gh/Xia-Weiwen/23/head
Closed

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern#142036
Xia-Weiwen wants to merge 7 commits intogh/Xia-Weiwen/23/basefrom
gh/Xia-Weiwen/23/head

Conversation

@Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Dec 4, 2024

Stack from ghstack (oldest at bottom):

Reopen of #139595

About the PR
In the implementation of SmoothQuant in Torchao, quantized linear is computed by _int_mm(a, b) + mul(b_scale) + mul(a_scale) (+ optional add for bias) with reshape and convert_dtype in between.
This PR adds a pass to fuse the corresponding patterns:

  • (no bias) reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape
  • (with bias) pattern_no_bias -> add -> reshape -> reshape

The patterns are replaced by onednn.qlinear_pointwise and onednn.qlinear_prepack, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains onednn.qlinear_pointwise only with packed weight constants.

Note that onednn.qlinear_pointwise only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after onednn.qlinear_pointwise.

Validation results
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:

  • Model: EleutherAI/gpt-j-6b
  • Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
  • Using Intel OMP and Tcmalloc
  • Running the example script of SmoothQuant in Torchao with TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile

Test plan

python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Differential Revision: D66796966

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142036

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit f3d6c5a with merge base b576a8c (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: quantization release notes category labels Dec 4, 2024
Xia-Weiwen added a commit that referenced this pull request Dec 4, 2024
ghstack-source-id: 02eb032
Pull Request resolved: #142036
@Xia-Weiwen Xia-Weiwen changed the title [Inductor][CPU] Fuse SmoothQuant int8 linear pattern [Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern Dec 4, 2024
@Xia-Weiwen Xia-Weiwen added intel This tag is for PR from Intel ciflow/trunk Trigger trunk jobs on your pull request and removed open source labels Dec 4, 2024
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 CI shows green. Would you like to import it? Thanks.

@jerryzh168
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

[ghstack-poisoned]
@jerryzh168
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
@Xia-Weiwen
Copy link
Collaborator Author

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Hi @jerryzh168 It failed to merge by pytorchmergebot. Could you please import it again? Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 The failure looks irrelevant. Could you please take a look and see if it can be imported again? Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 There is a dependent PR targeting 2.6 branch cut. Could you please check if this PR can be imported and merged? Thanks.

@jerryzh168
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 There is a dependent PR targeting 2.6 branch cut. Could you please check if this PR can be imported and merged? Thanks.

Hi @jerryzh168 We would like this in 2.6. Could you please merge this PR once internal checks show green? Thanks.

@jerryzh168
Copy link
Contributor

yeah I'm trying to land this

[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Dec 10, 2024
ghstack-source-id: e5964c8
Pull Request resolved: #142036
[ghstack-poisoned]
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Dec 11, 2024
ghstack-source-id: 7b2a046
Pull Request resolved: #142036
@jerryzh168
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Xia-Weiwen
Copy link
Collaborator Author

@jerryzh168 Thanks a lot

pytorchmergebot pushed a commit that referenced this pull request Dec 13, 2024
#142110)

### Summary

Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) -

- int8 quantized (symmetrically) activation (per token quantized).
- Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled).

The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`.

We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).

In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).

### More details

oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.

The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op.

The speedup over eager-mode is due to 2 reasons -
1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided).
2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time.

But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.

### Verification

Added UT in this PR
```
python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm
```

#### Corresponding torchao UTs

1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`.
The difference from #139595 is that there are no reshapes of the linear output in this pattern.

2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights -  ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu`

Pull Request resolved: #142110
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #142036
@Xia-Weiwen Xia-Weiwen deleted the gh/Xia-Weiwen/23/head branch December 14, 2024 12:51
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source release notes: quantization release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants