[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern#142036
[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern#142036Xia-Weiwen wants to merge 7 commits intogh/Xia-Weiwen/23/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142036
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit f3d6c5a with merge base b576a8c ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @jerryzh168 CI shows green. Would you like to import it? Thanks. |
|
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge |
Merge failedReason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR! Details for Dev Infra teamRaised by workflow job |
Hi @jerryzh168 It failed to merge by pytorchmergebot. Could you please import it again? Thanks. |
|
Hi @jerryzh168 The failure looks irrelevant. Could you please take a look and see if it can be imported again? Thanks. |
|
Hi @jerryzh168 There is a dependent PR targeting 2.6 branch cut. Could you please check if this PR can be imported and merged? Thanks. |
|
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi @jerryzh168 We would like this in 2.6. Could you please merge this PR once internal checks show green? Thanks. |
|
yeah I'm trying to land this |
|
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@jerryzh168 Thanks a lot |
#142110) ### Summary Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) - - int8 quantized (symmetrically) activation (per token quantized). - Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled). The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`. We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true). In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D). ### More details oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused. The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op. The speedup over eager-mode is due to 2 reasons - 1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided). 2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time. But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen. ### Verification Added UT in this PR ``` python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm ``` #### Corresponding torchao UTs 1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`. The difference from #139595 is that there are no reshapes of the linear output in this pattern. 2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu` Pull Request resolved: #142110 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #142036
ghstack-source-id: 426074b Pull Request resolved: pytorch/pytorch#142036
Stack from ghstack (oldest at bottom):
Reopen of #139595
About the PR
In the implementation of SmoothQuant in Torchao, quantized linear is computed by
_int_mm(a, b)+mul(b_scale)+mul(a_scale)(+ optionaladdfor bias) withreshapeandconvert_dtypein between.This PR adds a pass to fuse the corresponding patterns:
reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshapepattern_no_bias -> add -> reshape -> reshapeThe patterns are replaced by
onednn.qlinear_pointwiseandonednn.qlinear_prepack, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph containsonednn.qlinear_pointwiseonly with packed weight constants.Note that
onednn.qlinear_pointwiseonly supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias afteronednn.qlinear_pointwise.Validation results
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compileTest plan
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov
Differential Revision: D66796966