Skip to content

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern#139595

Closed
Xia-Weiwen wants to merge 22 commits intogh/Xia-Weiwen/18/basefrom
gh/Xia-Weiwen/18/head
Closed

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern#139595
Xia-Weiwen wants to merge 22 commits intogh/Xia-Weiwen/18/basefrom
gh/Xia-Weiwen/18/head

Conversation

@Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Nov 4, 2024

Stack from ghstack (oldest at bottom):

About the PR
In the implementation of SmoothQuant in Torchao, quantized linear is computed by _int_mm(a, b) + mul(b_scale) + mul(a_scale) (+ optional add for bias) with reshape and convert_dtype in between.
This PR adds a pass to fuse the corresponding patterns:

  • (no bias) reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape
  • (with bias) pattern_no_bias -> add -> reshape -> reshape

The patterns are replaced by onednn.qlinear_pointwise and onednn.qlinear_prepack, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains onednn.qlinear_pointwise only with packed weight constants.

Note that onednn.qlinear_pointwise does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after onednn.qlinear_pointwise.

Validation results
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:

  • Model: EleutherAI/gpt-j-6b
  • Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
  • Using Intel OMP and Tcmalloc
  • Running the example script of SmoothQuant in Torchao with TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile

Test plan

python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Differential Revision: D65702807

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139595

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6a9a6f9 with merge base 7dfb439 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: quantization release notes category labels Nov 4, 2024
Xia-Weiwen added a commit that referenced this pull request Nov 4, 2024
@Xia-Weiwen Xia-Weiwen marked this pull request as draft November 4, 2024 01:39
@Xia-Weiwen Xia-Weiwen added the intel This tag is for PR from Intel label Nov 4, 2024
[ghstack-poisoned]
@Xia-Weiwen Xia-Weiwen changed the title [Inductor][CPU] Fuse SmoothQuant pattern aronud _int_mm [Inductor][CPU] Fuse SmoothQuant int8 linear pattern Nov 4, 2024
[ghstack-poisoned]
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Nov 4, 2024
ghstack-source-id: dcd71ec
Pull Request resolved: #139595
@vadimkantorov
Copy link
Contributor

vadimkantorov commented Nov 4, 2024

A bit related (without mm) on adding a core frontend / ux function for such scaled, saturated float<>int casts (adding a frontend function is useful to promote safe, correct, fast/optimizable idioms and maybe also minimize the needed number of manual fusable patterns with mms):

@Xia-Weiwen
Copy link
Collaborator Author

A bit related (without mm) on adding a core frontend / ux function for such scaled, saturated float<>int casts (adding a frontend function is useful to promote safe, correct, fast/optimizable idioms and maybe also minimize the needed number of manual fusable patterns with mms):

Hi. Looks like you are looking for a frontend API for mul - add - clamp - convert? However, this PR is not intended for end users. And it's specific to the graph pattern of SmoothQuant from Torchao.

or
(with bias) pattern_no_bias -> add -> reshape -> reshape
"""
pattern_no_bias = CallFunction(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if #139102 will be able to help simplify the pattern as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. Can InvokeQuant represent the pattern of quantization or any pattern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's mostly targeting dequant I think, see: #139102 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Do you suggest waiting for that PR landing first then using InvokeQuant in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that's fine, I think we can revisit later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Nov 5, 2024
ghstack-source-id: 6ec3a0d
Pull Request resolved: #139595
Xia-Weiwen added a commit that referenced this pull request Nov 28, 2024
ghstack-source-id: f463110
Pull Request resolved: #139595
@Xia-Weiwen
Copy link
Collaborator Author

please skip and add a comment for this

Thanks. It's added. Please take a look.

@sanchitintel
Copy link
Collaborator

TorchAO UT test/integration/test_integration.py -v -k SmoothquantIntegrationTest.test_non_dynamically_quantizable_linear has a slightly different pattern. I opened #141851 for it by using ghstack to include this PR's changes.

Thanks!

[ghstack-poisoned]
[ghstack-poisoned]
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 I have updated this PR according to latest comments, including skipping the UT in fbcode. If everything looks good to you, could you please import it again? Thanks.

@jerryzh168
Copy link
Contributor

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/Xia-Weiwen/18/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/139595)

@jerryzh168
Copy link
Contributor

jerryzh168 commented Dec 4, 2024

@Xia-Weiwen I can't import the PR, even after rebase, it is saying: error: patch failed: fbcode/caffe2/test/inductor/test_mkldnn_pattern_matcher.py:174 error: fbcode/caffe2/test/inductor/test_mkldnn_pattern_matcher.py: patch does not apply, maybe you can manually rebase on main and I can try again

@Xia-Weiwen
Copy link
Collaborator Author

@jerryzh168 Thanks for the info. I have rebased this PR before I added this comment #139595 (comment) actually. Let me try again.

[ghstack-poisoned]
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 I have rebased manually. Please try again. Thanks.

@jerryzh168
Copy link
Contributor

@Xia-Weiwen thanks, still can't do it, maybe you can create a new PR and I can try again

@Xia-Weiwen
Copy link
Collaborator Author

@Xia-Weiwen thanks, still can't do it, maybe you can create a new PR and I can try again

Hi @jerryzh168 I have created a new PR: #142036 Please have a try, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source release notes: quantization release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants