[Inductor][CPU] Fuse SmoothQuant int8 linear pattern#139595
[Inductor][CPU] Fuse SmoothQuant int8 linear pattern#139595Xia-Weiwen wants to merge 22 commits intogh/Xia-Weiwen/18/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139595
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 6a9a6f9 with merge base 7dfb439 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
A bit related (without mm) on adding a core frontend / ux function for such scaled, saturated float<>int casts (adding a frontend function is useful to promote safe, correct, fast/optimizable idioms and maybe also minimize the needed number of manual fusable patterns with mms): |
Hi. Looks like you are looking for a frontend API for |
| or | ||
| (with bias) pattern_no_bias -> add -> reshape -> reshape | ||
| """ | ||
| pattern_no_bias = CallFunction( |
There was a problem hiding this comment.
wondering if #139102 will be able to help simplify the pattern as well
There was a problem hiding this comment.
Thanks for the info. Can InvokeQuant represent the pattern of quantization or any pattern?
There was a problem hiding this comment.
it's mostly targeting dequant I think, see: #139102 (comment)
There was a problem hiding this comment.
I see. Do you suggest waiting for that PR landing first then using InvokeQuant in this PR?
There was a problem hiding this comment.
no that's fine, I think we can revisit later
Thanks. It's added. Please take a look. |
|
TorchAO UT Thanks! |
|
Hi @jerryzh168 I have updated this PR according to latest comments, including skipping the UT in fbcode. If everything looks good to you, could you please import it again? Thanks. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
|
@Xia-Weiwen I can't import the PR, even after rebase, it is saying: |
|
@jerryzh168 Thanks for the info. I have rebased this PR before I added this comment #139595 (comment) actually. Let me try again. |
|
Hi @jerryzh168 I have rebased manually. Please try again. Thanks. |
|
@Xia-Weiwen thanks, still can't do it, maybe you can create a new PR and I can try again |
Hi @jerryzh168 I have created a new PR: #142036 Please have a try, thanks. |
Stack from ghstack (oldest at bottom):
About the PR
In the implementation of SmoothQuant in Torchao, quantized linear is computed by
_int_mm(a, b)+mul(b_scale)+mul(a_scale)(+ optionaladdfor bias) withreshapeandconvert_dtypein between.This PR adds a pass to fuse the corresponding patterns:
reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshapepattern_no_bias -> add -> reshape -> reshapeThe patterns are replaced by
onednn.qlinear_pointwiseandonednn.qlinear_prepack, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph containsonednn.qlinear_pointwiseonly with packed weight constants.Note that
onednn.qlinear_pointwisedoes not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias afteronednn.qlinear_pointwise.Validation results
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compileTest plan
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov
Differential Revision: D65702807