Skip to content

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt#142015

Closed
sanchitintel wants to merge 4 commits intogh/sanchitintel/3/basefrom
gh/sanchitintel/3/head
Closed

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt#142015
sanchitintel wants to merge 4 commits intogh/sanchitintel/3/basefrom
gh/sanchitintel/3/head

Conversation

@sanchitintel
Copy link
Collaborator

@sanchitintel sanchitintel commented Dec 3, 2024

Summary

Extends #139595 for Inductor pattern-matching pattern covered for torchao API int8_dynamic_activation_int8_weight in the following scenario (inference-only, freezing enabled) -

  • int8 quantized (symmetrically) activation (per token quantized).
  • Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled).

The pattern that's matched is torch._intmm -> convert to FP32/BF16 -> [optional expand for activation scale] ->mul -> mul.

We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).

In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).

More details

oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.

The fusion pattern used in this PR is torch._intmm -> convert to FP32/BF16 ->mul, which will be replaced by oneDNN qlinear op.

The speedup over eager-mode is due to 2 reasons -

  1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided).
  2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time.

But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.

Verification

Added UT in this PR

python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm

Corresponding torchao UTs

  1. int8 Smoothquant legacy API - TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear.
    The difference from [Inductor][CPU] Fuse SmoothQuant int8 linear pattern #139595 is that there are no reshapes of the linear output in this pattern.

  2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142015

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7dc7fa7 with merge base 7dfb439 (image):

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@sanchitintel sanchitintel changed the title [Inductor] Add torchao da8w8 pattern with sym quantized act & wgt [Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt Dec 4, 2024
[ghstack-poisoned]
sanchitintel added a commit that referenced this pull request Dec 4, 2024
@sanchitintel sanchitintel requested a review from jgong5 December 4, 2024 02:45
# In practice, though, they may also match smooth-quant pattern when a 2D input shape would be used.
# Since add is not currently being used as a oneDNN post-op, but is unfused, we don't need these patterns with bias.
# Ideally, we should add mul + add post-op support in ATen int8 oneDNN linear op.
pattern1_with_no_outer_or_act_reshape = get_pattern_no_bias(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question regarding to the bias. In @Xia-Weiwen's base PR, it handles 2 cases:

  • Case 1: when activation is per-tensor quant, the bias can be one of the inputs to qlinear.
  • Case 2: when activation is per-channel quant, the bias can't be fused and will exist as epilogue.

But here in this PR, we only register the pattern wo bias and may cause the difference of case 1. May I know the reason of this difference?

Copy link
Collaborator Author

@sanchitintel sanchitintel Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here in this PR, we only register the pattern wo bias and may cause the difference of case 1. May I know the reason of this difference?

torchao int8_dynamic_activation_int8_weight API supports per-token quantization of activation, and not a scalar activation scale.

Please refer to https://github.com/pytorch/ao/blob/1a0dbf1c41ad1c6f28d6501e1134b30ea2f2590d/torchao/quantization/quant_api.py#L741-L746

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, in case of int8_dynamic_activation_int8_weight, the activation scale is a vector.

@leslie-fang-intel, if the case of smooth-quant with 2D activation & scalar activation scale is to be supported, then bias would also have to be in the pattern. Please let me know if its support also needs to be added.

[ghstack-poisoned]
sanchitintel added a commit that referenced this pull request Dec 4, 2024
Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sanchitintel
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@sanchitintel
Copy link
Collaborator Author

The base ghstack PR changed, so need to create a new PR

@sanchitintel
Copy link
Collaborator Author

Opened #142110 with the new base PR.

@github-actions github-actions bot deleted the gh/sanchitintel/3/head branch January 5, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants