[Inductor][CPU] Fuse SmoothQuant int8 linear pattern by Xia-Weiwen · Pull Request #139595 · pytorch/pytorch

Xia-Weiwen · 2024-11-04T01:38:26Z

Stack from ghstack (oldest at bottom):

About the PR
In the implementation of SmoothQuant in Torchao, quantized linear is computed by _int_mm(a, b) + mul(b_scale) + mul(a_scale) (+ optional add for bias) with reshape and convert_dtype in between.
This PR adds a pass to fuse the corresponding patterns:

(no bias) reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape
(with bias) pattern_no_bias -> add -> reshape -> reshape

The patterns are replaced by onednn.qlinear_pointwise and onednn.qlinear_prepack, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains onednn.qlinear_pointwise only with packed weight constants.

Note that onednn.qlinear_pointwise does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after onednn.qlinear_pointwise.

Validation results
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:

Model: EleutherAI/gpt-j-6b
Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
Using Intel OMP and Tcmalloc
Running the example script of SmoothQuant in Torchao with TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile

Test plan

python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Differential Revision: D65702807

[ghstack-poisoned]

pytorch-bot · 2024-11-04T01:38:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139595

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6a9a6f9 with merge base 7dfb439 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (similar failure)
##[error]Process completed with exit code 1.

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#141498)
convnext_base

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: ae435bf Pull Request resolved: #139595

[ghstack-poisoned]

ghstack-source-id: dcd71ec Pull Request resolved: #139595

vadimkantorov · 2024-11-04T10:07:32Z

A bit related (without mm) on adding a core frontend / ux function for such scaled, saturated float<>int casts (adding a frontend function is useful to promote safe, correct, fast/optimizable idioms and maybe also minimize the needed number of manual fusable patterns with mms):

[feature request] [ux] Frontend methods for fused elementwise affine transform: mul+add+dtype convert + support integer_tensor.mul_(float_constant) and float_tensor.mul(some_constant, out = integer_tensor) maybe via new args rounding_mode=... and dtype=... + maybe support OpenCV-style saturated dtype conversions (e.g. clamp_ before conversion) #106624

Xia-Weiwen · 2024-11-05T01:10:01Z

A bit related (without mm) on adding a core frontend / ux function for such scaled, saturated float<>int casts (adding a frontend function is useful to promote safe, correct, fast/optimizable idioms and maybe also minimize the needed number of manual fusable patterns with mms):

[feature request] [ux] Frontend methods for fused elementwise affine transform: mul+add+dtype convert + support integer_tensor.mul_(float_constant) and float_tensor.mul(some_constant, out = integer_tensor) maybe via new args rounding_mode=... and dtype=... + maybe support OpenCV-style saturated dtype conversions (e.g. clamp_ before conversion) #106624

Hi. Looks like you are looking for a frontend API for mul - add - clamp - convert? However, this PR is not intended for end users. And it's specific to the graph pattern of SmoothQuant from Torchao.

jerryzh168 · 2024-11-05T01:35:39Z

torch/_inductor/fx_passes/quantization.py

+    or
+      (with bias) pattern_no_bias -> add -> reshape -> reshape
+    """
+    pattern_no_bias = CallFunction(


wondering if #139102 will be able to help simplify the pattern as well

Thanks for the info. Can InvokeQuant represent the pattern of quantization or any pattern?

it's mostly targeting dequant I think, see: #139102 (comment)

I see. Do you suggest waiting for that PR landing first then using InvokeQuant in this PR?

no that's fine, I think we can revisit later

Got it. Thanks.

leslie-fang-intel

Overall LGTM

aten/src/ATen/native/quantized/cpu/qlinear.cpp

torch/_inductor/fx_passes/quantization.py

[ghstack-poisoned]

ghstack-source-id: 6ec3a0d Pull Request resolved: #139595

ghstack-source-id: f463110 Pull Request resolved: #139595

Xia-Weiwen · 2024-11-28T01:30:18Z

please skip and add a comment for this

Thanks. It's added. Please take a look.

sanchitintel · 2024-12-02T07:16:18Z

TorchAO UT test/integration/test_integration.py -v -k SmoothquantIntegrationTest.test_non_dynamically_quantizable_linear has a slightly different pattern. I opened #141851 for it by using ghstack to include this PR's changes.

Thanks!

[ghstack-poisoned]

torch/_inductor/fx_passes/quantization.py

test/inductor/test_mkldnn_pattern_matcher.py

[ghstack-poisoned]

Xia-Weiwen · 2024-12-03T09:58:22Z

Hi @jerryzh168 I have updated this PR according to latest comments, including skipping the UT in fbcode. If everything looks good to you, could you please import it again? Thanks.

jerryzh168 · 2024-12-04T00:24:31Z

@pytorchbot rebase

pytorchmergebot · 2024-12-04T00:26:02Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-12-04T00:26:14Z

Successfully rebased gh/Xia-Weiwen/18/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/139595)

jerryzh168 · 2024-12-04T00:35:56Z

@Xia-Weiwen I can't import the PR, even after rebase, it is saying: error: patch failed: fbcode/caffe2/test/inductor/test_mkldnn_pattern_matcher.py:174 error: fbcode/caffe2/test/inductor/test_mkldnn_pattern_matcher.py: patch does not apply, maybe you can manually rebase on main and I can try again

Xia-Weiwen · 2024-12-04T01:30:56Z

@jerryzh168 Thanks for the info. I have rebased this PR before I added this comment #139595 (comment) actually. Let me try again.

[ghstack-poisoned]

Xia-Weiwen · 2024-12-04T02:05:37Z

Hi @jerryzh168 I have rebased manually. Please try again. Thanks.

jerryzh168 · 2024-12-04T02:33:39Z

@Xia-Weiwen thanks, still can't do it, maybe you can create a new PR and I can try again

Xia-Weiwen · 2024-12-04T06:13:46Z

@Xia-Weiwen thanks, still can't do it, maybe you can create a new PR and I can try again

Hi @jerryzh168 I have created a new PR: #142036 Please have a try, thanks.

Update

176c346

[ghstack-poisoned]

Xia-Weiwen requested review from digantdesai, jerryzh168, jianyuh, kimishpatel and salilsdesai as code owners November 4, 2024 01:38

pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: quantization release notes category labels Nov 4, 2024

Xia-Weiwen added a commit that referenced this pull request Nov 4, 2024

[Inductor][CPU] Fuse SmoothQuant pattern aronud _int_mm

dcb731b

ghstack-source-id: ae435bf Pull Request resolved: #139595

Xia-Weiwen marked this pull request as draft November 4, 2024 01:39

pytorchbot added the open source label Nov 4, 2024

Xia-Weiwen added the intel This tag is for PR from Intel label Nov 4, 2024

Update

b99fba2

[ghstack-poisoned]

Xia-Weiwen changed the title ~~[Inductor][CPU] Fuse SmoothQuant pattern aronud _int_mm~~ [Inductor][CPU] Fuse SmoothQuant int8 linear pattern Nov 4, 2024

Update

ccc8e3b

[ghstack-poisoned]

Update

bbdf5d9

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Nov 4, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern

eef611a

ghstack-source-id: dcd71ec Pull Request resolved: #139595

Xia-Weiwen requested review from jgong5 and leslie-fang-intel November 5, 2024 01:10

jerryzh168 reviewed Nov 5, 2024

View reviewed changes

leslie-fang-intel approved these changes Nov 5, 2024

View reviewed changes

aten/src/ATen/native/quantized/cpu/qlinear.cpp Show resolved Hide resolved

torch/_inductor/fx_passes/quantization.py Outdated Show resolved Hide resolved

Update

5116fff

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Nov 5, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern

40d9f63

ghstack-source-id: 6ec3a0d Pull Request resolved: #139595

Xia-Weiwen added a commit that referenced this pull request Nov 28, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern

1c6b2c2

ghstack-source-id: f463110 Pull Request resolved: #139595

sanchitintel mentioned this pull request Dec 2, 2024

Add torchao int8 da8w8 sym act sym wgt linear pattern for CPU #141851

Closed

Update

07975b2

[ghstack-poisoned]

leslie-fang-intel reviewed Dec 3, 2024

View reviewed changes

torch/_inductor/fx_passes/quantization.py Show resolved Hide resolved

sanchitintel reviewed Dec 3, 2024

View reviewed changes

test/inductor/test_mkldnn_pattern_matcher.py Show resolved Hide resolved

sanchitintel reviewed Dec 3, 2024

View reviewed changes

test/inductor/test_mkldnn_pattern_matcher.py Show resolved Hide resolved

Update

e2b8df1

[ghstack-poisoned]

sanchitintel mentioned this pull request Dec 3, 2024

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142015

Closed

Update

f4cfae2

[ghstack-poisoned]

Update

6a9a6f9

[ghstack-poisoned]

Xia-Weiwen mentioned this pull request Dec 4, 2024

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

Closed

sanchitintel mentioned this pull request Dec 5, 2024

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142110

Closed

Conversation

Xia-Weiwen commented Nov 4, 2024 • edited by sanchitintel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139595

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

vadimkantorov commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xia-Weiwen commented Nov 5, 2024

Uh oh!

jerryzh168 Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Nov 28, 2024

Uh oh!

sanchitintel commented Dec 2, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Dec 3, 2024

Uh oh!

jerryzh168 commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Uh oh!

jerryzh168 commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xia-Weiwen commented Dec 4, 2024

Uh oh!

Xia-Weiwen commented Dec 4, 2024

Uh oh!

jerryzh168 commented Dec 4, 2024

Uh oh!

Xia-Weiwen commented Dec 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Xia-Weiwen commented Nov 4, 2024 •

edited by sanchitintel

Loading

pytorch-bot bot commented Nov 4, 2024 •

edited

Loading

vadimkantorov commented Nov 4, 2024 •

edited

Loading

jerryzh168 commented Dec 4, 2024 •

edited

Loading