[WOQ] Add XPU kernel for _weight_int8pack_mm by xiaowangintel · Pull Request #160938 · pytorch/pytorch

xiaowangintel · 2025-08-19T02:19:28Z

Summary:
This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA.

Motivation:
Same as #159325.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

pytorch-bot · 2025-08-19T02:19:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160938

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit bcdb3c0 with merge base 90f50f7 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-n-py3.10 / test (default, 1, 8, linux.idc.xpu) (gh) (disabled by #162048 but the issue was closed recently and a rebase is needed to make it pass)
inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_addmm_persistent_tma_a_transposed_False_b_transposed_False_dynamic_False
xpu / linux-jammy-xpu-n-py3.10 / test (default, 5, 8, linux.idc.xpu) (gh) (disabled by #161931 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_aot_autograd_cache.py::AOTAutogradCacheTests::test_autograd_inductor_guards_device_xpu_bfloat16_requires_grad_True

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-08-19T02:23:24Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

github-actions · 2025-08-19T02:23:25Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h

xiaowangintel · 2025-08-19T02:28:29Z

@liangan1 @ZhiweiYan-96 @guangyey @EikanWang Please help to review this pr.

guangyey · 2025-08-19T02:50:26Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+  TORCH_CHECK(
+      A.dtype() == kBFloat16 || A.dtype() == kHalf || A.dtype() == kFloat,
+      __func__,
+      " : expect A to be either 32-bit or 16-bit float tensor.");
+  TORCH_CHECK(A.dim() == 2, __func__, " : expect A to be 2D tensor.");
+  TORCH_CHECK(
+      A.stride(1) == 1,
+      __func__,
+      " : A must be contiguous on the last dimension.");
+  TORCH_CHECK(B.dtype() == kChar, __func__, " : expect B to be int8 tensor.");
+  TORCH_CHECK(B.is_contiguous(), __func__, " : expect B to be contiguous.");
+  TORCH_CHECK(B.size(1) == K, __func__, " : expect B.size(1) == ", K);


__func__ has been included in TORCH_CHECK. So, __func__ could be removed here.

liangan1 · 2025-08-19T02:52:18Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+      A.contiguous(),
+      1.0,
+      0,
+      B.contiguous(),


Suggested change

B.contiguous(),

B

liangan1 · 2025-08-19T02:54:46Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+  // --- Launch kernel ---
+  Tensor bias = at::Tensor();
+  Tensor mat2_zero_points = at::Tensor();
+  Tensor non_const_scales = scales;


Since there is no more operation on non_const_scales, why not use scales directly?

The quantized_matmul receive weight scales as lvalue reference. However, scales is const Tensor&, and cause C++ compilation errors.

liangan1 · 2025-08-19T02:55:30Z

Generally LGTM.

EikanWang

LGTM

EikanWang · 2025-08-19T03:00:41Z

aten/src/ATen/native/mkldnn/xpu/Blas.cpp

+      " : expect scales to be 1d tensor with size ",
+      N);
+
+  auto C = at::empty({M, N}, A.options());


Is it possible to invoke native::empty directly?

liangan1 · 2025-08-21T01:36:40Z

@jerryzh168 can you help to review this PR?

liangan1 · 2025-09-01T05:22:05Z

@xiaowangintel pls rebase the code and fix the CI issue.

liangan1 · 2025-09-08T02:57:16Z

@drisspg can you help to review this PR?

xiaowangintel · 2025-09-19T01:51:47Z

@jerryzh168 can you help to review this PR?

jerryzh168

can't provide meaningful reviews as I'm not familiar with hardware details, but can stamp.

also should this op live in torchao in the end?

liangan1 · 2025-09-19T02:32:45Z

can't provide meaningful reviews as I'm not familiar with hardware details, but can stamp.

also should this op live in torchao in the end?

Thanks Jerry. Yes. This op will be used to speedup the WOQ-INT8 in the torchAO.

xiaowangintel · 2025-09-19T03:21:04Z

@pytorchbot merge

pytorchmergebot · 2025-09-19T03:22:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168

Summary: Supports woq_int8 inductor pattern on Intel GPU. When using torch.compile, woq_int8 will be lowering to _weight_int8pack_mm instead of being falled back mul().sum(). The Intel GPU backend of _weight_int8pack_mm was supported in #160938. Pull Request resolved: #163615 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel

…ch#163615) Summary: Supports woq_int8 inductor pattern on Intel GPU. When using torch.compile, woq_int8 will be lowering to _weight_int8pack_mm instead of being falled back mul().sum(). The Intel GPU backend of _weight_int8pack_mm was supported in pytorch#160938. Pull Request resolved: pytorch#163615 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel

xiaowangintel requested review from EikanWang and gujinghui as code owners August 19, 2025 02:19

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: inductor (aoti) labels Aug 19, 2025

xiaowangintel changed the title ~~[WOQ] Add XPU kernel for _weight_int8pack_mm~~ [WIP][WOQ] Add XPU kernel for _weight_int8pack_mm Aug 19, 2025

pytorchbot added the open source label Aug 19, 2025

guangyey added this to PyTorch Intel Aug 19, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 19, 2025

guangyey reviewed Aug 19, 2025

View reviewed changes

liangan1 reviewed Aug 19, 2025

View reviewed changes

EikanWang approved these changes Aug 19, 2025

View reviewed changes

EikanWang requested a review from drisspg August 19, 2025 03:02

ZhiweiYan-96 approved these changes Aug 22, 2025

View reviewed changes

liangan1 approved these changes Aug 27, 2025

View reviewed changes

xiaowangintel force-pushed the xw/int8_woq branch from 9df7c03 to 74737aa Compare September 1, 2025 05:37

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Sep 1, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Sep 1, 2025

xiaowangintel changed the title ~~[WIP][WOQ] Add XPU kernel for _weight_int8pack_mm~~ [WOQ] Add XPU kernel for _weight_int8pack_mm Sep 2, 2025

pytorch-bot bot added the ciflow/inductor label Sep 2, 2025

[WOQ] Add XPU kernel for _weight_int8pack_mm

faca71e

xiaowangintel force-pushed the xw/int8_woq branch from 74737aa to faca71e Compare September 3, 2025 02:00

pytorch-bot bot removed the ciflow/inductor label Sep 3, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Sep 3, 2025

[WOQ] Add XPU kernel for _weight_int8pack_mm

bcdb3c0

etaf added the ciflow/xpu Run XPU CI tasks label Sep 3, 2025

guangyey requested a review from jerryzh168 September 19, 2025 01:55

jerryzh168 approved these changes Sep 19, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 19, 2025

pytorchmergebot added the merging label Sep 19, 2025

pytorchmergebot added the Merged label Sep 19, 2025

pytorchmergebot closed this in ab5086a Sep 19, 2025

github-project-automation bot moved this to Done in PyTorch Intel Sep 19, 2025

pytorchmergebot removed the merging label Sep 19, 2025

xiaowangintel mentioned this pull request Sep 23, 2025

[xpu][feature] Supports woq_int8 inductor pattern on Intel GPU #163615

Closed

ZhaoqiongZ mentioned this pull request Dec 29, 2025

xpu: missing aten ops needed to support Huggingface quanto #132947

Closed

4 tasks

xiaowangintel mentioned this pull request Dec 30, 2025

[XPU] Support _weight_int8pack_mm path for all backends and INT8 quantization on Intel XPU pytorch/ao#3061

Open

Conversation

xiaowangintel commented Aug 19, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160938

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

github-actions bot commented Aug 19, 2025

Attention! native_functions.yaml was changed

Uh oh!

github-actions bot commented Aug 19, 2025

Attention! PyTorch one of the C-stable API file was changed

Uh oh!

xiaowangintel commented Aug 19, 2025

Uh oh!

guangyey Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

liangan1 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

liangan1 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

xiaowangintel Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

liangan1 commented Aug 19, 2025

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

EikanWang Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

liangan1 commented Aug 21, 2025

Uh oh!

liangan1 commented Sep 1, 2025

Uh oh!

liangan1 commented Sep 8, 2025

Uh oh!

xiaowangintel commented Sep 19, 2025

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

liangan1 commented Sep 19, 2025

Uh oh!

xiaowangintel commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

xiaowangintel commented Aug 19, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 19, 2025 •

edited

Loading