[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt#142015
[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt#142015sanchitintel wants to merge 4 commits intogh/sanchitintel/3/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142015
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 7dc7fa7 with merge base 7dfb439 ( UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| # In practice, though, they may also match smooth-quant pattern when a 2D input shape would be used. | ||
| # Since add is not currently being used as a oneDNN post-op, but is unfused, we don't need these patterns with bias. | ||
| # Ideally, we should add mul + add post-op support in ATen int8 oneDNN linear op. | ||
| pattern1_with_no_outer_or_act_reshape = get_pattern_no_bias( |
There was a problem hiding this comment.
I have a question regarding to the bias. In @Xia-Weiwen's base PR, it handles 2 cases:
- Case 1: when activation is per-tensor quant, the bias can be one of the inputs to qlinear.
- Case 2: when activation is per-channel quant, the bias can't be fused and will exist as epilogue.
But here in this PR, we only register the pattern wo bias and may cause the difference of case 1. May I know the reason of this difference?
There was a problem hiding this comment.
But here in this PR, we only register the pattern wo bias and may cause the difference of case 1. May I know the reason of this difference?
torchao int8_dynamic_activation_int8_weight API supports per-token quantization of activation, and not a scalar activation scale.
Please refer to https://github.com/pytorch/ao/blob/1a0dbf1c41ad1c6f28d6501e1134b30ea2f2590d/torchao/quantization/quant_api.py#L741-L746
There was a problem hiding this comment.
Basically, in case of int8_dynamic_activation_int8_weight, the activation scale is a vector.
@leslie-fang-intel, if the case of smooth-quant with 2D activation & scalar activation scale is to be supported, then bias would also have to be in the pattern. Please let me know if its support also needs to be added.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
The base ghstack PR changed, so need to create a new PR |
|
Opened #142110 with the new base PR. |
Summary
Extends #139595 for Inductor pattern-matching pattern covered for torchao API
int8_dynamic_activation_int8_weightin the following scenario (inference-only, freezing enabled) -The pattern that's matched is
torch._intmm-> convert to FP32/BF16 -> [optional expand for activation scale] ->mul->mul.We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).
In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).
More details
oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.
The fusion pattern used in this PR is
torch._intmm-> convert to FP32/BF16 ->mul, which will be replaced by oneDNN qlinear op.The speedup over eager-mode is due to 2 reasons -
But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.
Verification
Added UT in this PR
Corresponding torchao UTs
int8 Smoothquant legacy API -
TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear.The difference from [Inductor][CPU] Fuse SmoothQuant int8 linear pattern #139595 is that there are no reshapes of the linear output in this pattern.
int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights -
TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpuStack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov