[Intel GPU] int4 WOQ gemm XPU Support by ZhiweiYan-96 · Pull Request #137566 · pytorch/pytorch

ZhiweiYan-96 · 2024-10-09T07:24:35Z

Stack from ghstack (oldest at bottom):

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @EikanWang @fengyuan14 @guangyey

[ghstack-poisoned]

pytorch-bot · 2024-10-09T07:24:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137566

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 8b14d42 with merge base d7f3cd0 ():

NEW FAILURES - The following jobs have failed:

xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu) (gh)
inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pow_by_natural_log2_dynamic_shapes_dynamic_shapes_xpu
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu) (gh)
export/test_cpp_serdes.py::CppSerdesTestExport::test_device_to_gpu_cpp_serdes
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu) (gh)
export/test_retraceability.py::RetraceExportTestExport::test_device_to_gpu_retraceability_strict
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh)
inductor/test_torchinductor.py::GPUTests::test_pow_by_natural_log2_dynamic_shapes_xpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 403047a Pull Request resolved: #137566

ZhiweiYan-96 · 2024-10-09T07:27:48Z

@zhuyuhua-v Could you please review the PR?

EikanWang · 2024-10-10T01:27:13Z

+      "oneDNN input matrixes must have the same ranks");
+  TORCH_CHECK(result.defined(), "oneDNN matmul result should be defined");
+
+  at::Device curDevice = at::Device(at::kXPU, at::xpu::current_device());


Please unify the code style. curDevice -> cur_device.

Thanks for you suggestions, the naming has been changed.

EikanWang · 2024-10-10T01:31:30Z

+    mb = dst.size(0);
+    TORCH_CHECK(
+        mb == m1.size(0) && mb == m2.size(0),
+        "batch size mismatch, dst mb: ",


Is mb a common term? Can users fully understand the exact meaning of mb?

Thanks for your suggestions, mb means minibach here, but i review the code and remove mb in the code since int4_gemm has no need to handle batch currently.

EikanWang · 2024-10-10T01:35:38Z

+  scale_usr_md = dnnl::memory::desc(scale_dims, scale_user_dt, scale_strides);
+  zp_usr_md = dnnl::memory::desc(zp_usr_dims, zp_user_dt, zp_usr_strides);
+  dst_usr_md = dnnl::memory::desc(dst_dims, dst_usr_dt, dst_strides);
+  // STEP4: create dnnl::memory


Where are STEP 2 and STEP 3?

I have removed these kind of comments and add new comments in the codes.

EikanWang · 2024-10-10T01:46:28Z

+  args.insert({DNNL_ARG_ATTR_ZERO_POINTS | DNNL_ARG_WEIGHTS, zp_usr_m});
+
+  sycl::event matmul_event = dnnl::sycl_interop::execute(matmul_p, stream, args, deps);
+  if (!dst.is_same(result))


When is dst not the same as result?

These issues roots from the woq matmul is ported from regular matmul. int4_gemm has no need to consider dst is not same as result currently, I have removed the code.

EikanWang · 2024-10-10T01:47:09Z

+  sycl::event matmul_event = dnnl::sycl_interop::execute(matmul_p, stream, args, deps);
+  if (!dst.is_same(result))
+    result.copy_(dst);
+  result = resize_as_onednn_mat1(mat1_, result);


When is resize_as_onednn_mat1 required?

EikanWang · 2024-10-10T01:56:39Z

+  args.insert({DNNL_ARG_SCRATCHPAD, scratchpad_memory});
+
+  if (attr.with_binary())
+    attr.construct_post_binary(matmul_pd, args);


attr constructs the post binary. However, dnnl::post_ops po = attr.extract_post_ops(dst); has extracted the post ops and pattr.set_post_ops(po); has assigned the post op to matmul primitive attribute. Is it a valid behavior?

int4 would have no post-ops currently, i have removed the code, thanks.

EikanWang · 2024-10-10T01:58:07Z

+      dnnl::memory::data_type::s8);
+  // Set fpmath mode with `apply_to_int=true` to apply fpmath mode behavior to
+  // integral primitives (in this example, matmul).
+  pattr.set_fpmath_mode(dnnl::fpmath_mode::f16, true);


OneDNN supports both f16 and bf16. Why do we need to constrain the dtype?

We know have a control statement to determine which dtype is used for fpmath_mode, thanks. However, bf16 would have runtime issue in oneDNN at current version. The bf16 dtype is valid in newer version of onednn.

ZhiweiYan-96 · 2024-10-15T06:47:16Z

@liangan1 Could you please review the PR?

liangan1 · 2024-10-15T23:14:00Z

+  TORCH_CHECK(
+      dims == mat1.dim() && dims == mat2.dim(),
+      "oneDNN input matrixes must have the same ranks");
+  TORCH_CHECK(result.defined(), "oneDNN matmul result should be defined");


Since you have flatten the mat1 and mat2 into dims=2 and the result is also 2 dimension empty tensor. when will dim=3 and result is not defined? Can you show a example?

Some logic is too old, and we have removed such weird code in gemm integration now.

liangan1 · 2024-10-15T23:21:06Z

+    Attr attr,
+    const c10::optional<Tensor>& g_idx,
+    const std::vector<sycl::event>& deps,
+    Tensor b_raw = at::Tensor()) {


Change to bias_raw?

bias is not presented in weight_int4pack_mm API, and I have removed bias related code in newest commit. Thanks for your suggestions.

liangan1 · 2024-10-15T23:31:37Z

+              (b.size(0) == 1 && b.size(1) == 1),
+          "matmul supports [m, n] or [1, n] or [m, 1] or [1, 1] when bias dim is 2 ...");
+      if (b.size(0) == 1 && b.size(1) == 1)
+        b = b.expand({1, n}).contiguous();


In other case(e.g., b.dim()=1/3/0), you always expand the b to the same dim to m1. Whether it works when the m1.dim()==3 while b.dim()==2? According to the doc of onednn: "all tensors (including bias
, if it exists) must have the same number of dimensions."

These codes have been removed by me as it is bias -related. Thanks for you reminding.

liangan1 · 2024-10-15T23:40:11Z

+  auto m2_usr_dt = get_onednn_dtype(m2);
+  auto scale_user_dt = get_onednn_dtype(scale_); // half <==> fp16
+  //   auto zp_user_dt = dnnl::memory::data_type::s4; // int32, representing 8xint4
+  auto zp_user_dt = get_onednn_dtype(zp_);


Suggest to change xxx_user_xxx to xxx_usr_xxx to unify the style. Due to onednn support different data types , suggest to change to "e.g., half<==>f16"

liangan1 · 2024-10-16T00:31:42Z

+  return output.view_symint(sizes);
+}
+
+sycl::event woq_matmul_int4(


Suggest to add more function description here. e.g. the activation data type supported, data layout information for both inputs. etc...

More detailed description is added in older commits.

liangan1 · 2024-10-16T02:26:28Z

+
+  m2_usr_dims = {compressed_k, n};
+  scale_dims = {num_groups, n};
+  zp_dims = {1};


The dims of zp_dims is not aligned with the original zp inputs. With this limitation, only the symmetric or per-tensor quantization is supported. Pls add the comments about this limitation of oneDNN.

OneDNN provides us with a way to support asymmetry, allowing us to handle asymmetrical scenarios. I'm currently testing it, and if it works, I will modify it here to support both symmetric and asymmetric logic.

airMeng

There should be a prepack process since OneDNN doesn't support the most popular layout

[ghstack-poisoned]

ghstack-source-id: 451a44c Pull Request resolved: #137566

[ghstack-poisoned]

ghstack-source-id: ab34a0e Pull Request resolved: #137566

liangan1 · 2024-11-05T05:00:27Z

There should be a prepack process since OneDNN doesn't support the most popular layout

https://github.com/intel/torch-xpu-ops/pull/1035/files This PR is used to do int4 weight prepack.

[ghstack-poisoned]

ghstack-source-id: 9863ceb Pull Request resolved: #137566

[ghstack-poisoned]

ghstack-source-id: 6919d03 Pull Request resolved: #137566

liangan1 · 2024-11-28T08:02:06Z

+                b, n_bit=4, q_group_size=q_group
+            )
+            # b_int4pack [n, k//8]
+            b_int4pack = torch._convert_weight_to_int4pack(


This should be b_int4pack [k//8, n]

Thanks for reminding, I have modified the description here.

liangan1 · 2024-11-29T02:00:32Z

+  sizes[sizes.size() - 1] = n;
+  return output.view_symint(sizes);
+}
+


Should remove this?

Yes, these codes have been removed in newest commit, thanks for reminding.

liangan1 · 2024-11-29T02:02:26Z

+  Tensor m1 = is_onednn_matmul_strides(mat1_) ? mat1_ : mat1_.contiguous();
+  //m2_ may be a 4 dims fake tensor in torchAO with shape {N / 8, K / (16 * innerKTiles), 32, innerKTiles / 2}
+  //Tensor m2 = mat2_.flatten(0, -2); //ToDo: change to the fke shape: mat2_.flatten(0, -2); // N1
+  Tensor m2 = is_onednn_matmul_strides(mat2_) ? mat2_ : mat2_.contiguous();


Remove this comments.

liangan1 · 2024-11-29T02:08:55Z

-  auto expected_m1_md = matmul_pd.src_desc();
-  auto expected_m2_md = matmul_pd.weights_desc();
-  auto expected_dst_md = matmul_pd.dst_desc();
-


Need to remove this part.

liangan1 · 2024-11-29T02:10:46Z

+    zeros = min_int - min_val.div(scales).round()
+    zeros = torch.clamp(zeros, min_int, max_int)
+    zeros = zeros.to(torch.int8)
    assert torch.isnan(zeros).sum() == 0


This is also used in tinygemm, should not change this one.

Thanks for pointing out, I have moved the codes to xpu/test_gemm.py

EikanWang · 2025-03-03T07:41:41Z

+    const at::Tensor& zp, // [k/group_size, N]
+    int64_t group_size,
+    Attr attr,
+    const std::vector<sycl::event>& deps = {});


why does this operation require deps?

Formerly, fs1 requires we add events at oneDNN integration layer for profiling purposes. For me, it is just intended to have consistent API with conv/gemm. Do we need to remove this?

EikanWang · 2025-03-03T07:45:14Z

+    const at::Tensor& scale, // [K/group_size, N]
+    const at::Tensor& zp, // [k/group_size, N]
+    int64_t group_size,
+    Attr attr,


Suggested change

Attr attr,

std::optional<Attr> attr = std::nullopt,

will removed attr, as we do not append post-op currently.

EikanWang · 2025-03-03T07:46:38Z

+    const at::Tensor& zp, // [k/group_size, N]
+    int64_t group_size,
+    Attr attr,
+    const std::vector<sycl::event>& deps = {});


Suggested change

const std::vector<sycl::event>& deps = {});

const std::optional<std::vector<sycl::event>>& deps = std::nullopt);

EikanWang · 2025-03-03T07:48:23Z

+
+  // qscale:[K/qGroupSize, N]
+  // qzp:[K/qGroupSize, N]
+  woq_matmul_int4(C, A, B, qScale, qZeros, qGroupSize, onednn::Attr());


Is there any case that we need to fuse other operations? What's the motivation here to provide attributes?

EikanWang · 2025-03-03T07:52:49Z

+    const Tensor& A,
+    const Tensor& B,
+    int64_t qGroupSize,
+    const Tensor& qScale,
+    const Tensor& qZeros) {


@ZhiweiYan-96 , the code style of Blass.cpp is snake_case, why is the style of these variables camelCase?

hi, @EikanWang The naming style here is for aligning with other backend like cuda (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu#L1097) and cpu (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LinearAlgebra.cpp#L3461)

EikanWang · 2025-03-03T07:55:48Z

+
+  at::Device cur_device = at::Device(at::kXPU, at::xpu::current_device());
+  auto engine = GpuEngineManager::Instance().get_engine(cur_device);
+  auto stream = GpuStreamManager::Instance().get_stream();


@ZhiweiYan-96 , may I know where the guard code to ensure all the input tensors to be on the same device?

EikanWang · 2025-03-03T08:05:18Z

+  dst_md = dnnl::memory::desc(dst_dims, dst_dt, dst_strides);
+
+  std::unordered_map<int, dnnl::memory> args;
+  dnnl::post_ops po = attr.extract_post_ops(dst);


The po should be useless. Has this file been added to torch linter?

Post-ops is not required at present. We can remove the post op and added it back when it is necessary.

All file in xpu/detail/*.cpp is in linter checking list. I met this before. It should caused that, linter does not check this noused style issue.

[ghstack-poisoned]

guangyey · 2025-03-25T02:07:53Z

+  auto engine = GpuEngineManager::Instance().get_engine(cur_device);
+  auto stream = GpuStreamManager::Instance().get_stream();


Suggested change

auto engine = GpuEngineManager::Instance().get_engine(cur_device);

auto stream = GpuStreamManager::Instance().get_stream();

auto& engine = GpuEngineManager::Instance().get_engine();

auto& stream = GpuStreamManager::Instance().get_stream();

thanks for the information. Has updated the code.

[ghstack-poisoned]

ZhiweiYan-96 · 2025-03-25T08:33:17Z

Update

Remove all usage of post-ops code (attr, post-op). The reason is that no post-op requirement for int4 gemm currently.
Remove sycl::event related code. We will add this back when it is really required.

[ghstack-poisoned]

pytorchmergebot · 2025-04-08T01:54:01Z

Rebased gh/ZhiweiYan-96/47/orig onto refs/remotes/origin/viable/strict because #147962 was rebased, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/137566)

[ghstack-poisoned]

pytorchmergebot · 2025-04-08T15:13:20Z

Starting merge as part of PR stack under #147962

pytorchmergebot · 2025-04-08T15:30:03Z

Starting merge as part of PR stack under #147962

…tration (#147962) Pull Request resolved: #147962 Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang ghstack dependencies: #137566 Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>

Pull Request resolved: pytorch#137566 Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>

…tration (pytorch#147962) Pull Request resolved: pytorch#147962 Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang ghstack dependencies: pytorch#137566 Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>

Pull Request resolved: pytorch#137566 Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>

…tration (pytorch#147962) Pull Request resolved: pytorch#147962 Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang ghstack dependencies: pytorch#137566 Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>

ghstack-source-id: c2c4f90 Pull Request resolved: pytorch/pytorch#137566

Update

45fdfbb

[ghstack-poisoned]

pytorch-bot Bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 9, 2024

ZhiweiYan-96 added a commit that referenced this pull request Oct 9, 2024

[Intel GPU] int4 WOQ gemm XPU Support

f4126b3

ghstack-source-id: 403047a Pull Request resolved: #137566

ZhiweiYan-96 marked this pull request as draft October 9, 2024 07:26

ZhiweiYan-96 requested a review from EikanWang October 9, 2024 07:28

ZhiweiYan-96 added module: xpu Intel XPU related issues topic: not user facing topic category ciflow/xpu Run XPU CI tasks labels Oct 9, 2024

pytorchbot added the open source label Oct 9, 2024

EikanWang reviewed Oct 10, 2024

View reviewed changes

liangan1 reviewed Oct 16, 2024

View reviewed changes

airMeng reviewed Oct 31, 2024

View reviewed changes

Update

78f1204

[ghstack-poisoned]

ZhiweiYan-96 added a commit that referenced this pull request Nov 2, 2024

[Intel GPU] int4 WOQ gemm XPU Support

35eb713

ghstack-source-id: 451a44c Pull Request resolved: #137566

Update

b3cf1c5

[ghstack-poisoned]

ZhiweiYan-96 added a commit that referenced this pull request Nov 5, 2024

[Intel GPU] int4 WOQ gemm XPU Support

7e579ed

ghstack-source-id: ab34a0e Pull Request resolved: #137566

Update

b7a91c9

[ghstack-poisoned]

ZhiweiYan-96 added a commit that referenced this pull request Nov 24, 2024

[Intel GPU] int4 WOQ gemm XPU Support

1f59bb8

ghstack-source-id: 9863ceb Pull Request resolved: #137566

Update

9ed8263

[ghstack-poisoned]

ZhiweiYan-96 requested a review from zhuyuhua-v November 28, 2024 06:58

Update

9c5d49f

[ghstack-poisoned]

ZhiweiYan-96 added a commit that referenced this pull request Nov 28, 2024

[Intel GPU] int4 WOQ gemm XPU Support

2e0e466

ghstack-source-id: 6919d03 Pull Request resolved: #137566

liangan1 reviewed Nov 29, 2024

View reviewed changes

airMeng mentioned this pull request Dec 2, 2024

linear_int4_kernel for XPU intel/torch-xpu-ops#1130

Merged

EikanWang reviewed Mar 3, 2025

View reviewed changes

ZhiweiYan-96 and others added 2 commits March 4, 2025 01:48

Update

f5e5b0e

[ghstack-poisoned]

Update

ed93be5

[ghstack-poisoned]

guangyey approved these changes Mar 11, 2025

View reviewed changes

Update

dbb110a

[ghstack-poisoned]

guangyey reviewed Mar 25, 2025

View reviewed changes

Update

18d1f4c

[ghstack-poisoned]

ZhiweiYan-96 requested a review from EikanWang March 25, 2025 08:33

EikanWang approved these changes Apr 7, 2025

View reviewed changes

EikanWang moved this from Pre-Review Required to Approved in PyTorch Intel Apr 7, 2025

EikanWang marked this pull request as ready for review April 7, 2025 13:17

EikanWang requested a review from gujinghui as a code owner April 7, 2025 13:17

Update

8083f57

[ghstack-poisoned]

Update

8b14d42

[ghstack-poisoned]

pytorchmergebot closed this in da73225 Apr 8, 2025

pytorchmergebot added the Merged label Apr 8, 2025

github-project-automation Bot moved this from Approved to Done in PyTorch Intel Apr 8, 2025

Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request Apr 22, 2025

[Intel GPU] int4 WOQ gemm XPU Support

d0984cf

ghstack-source-id: c2c4f90 Pull Request resolved: pytorch/pytorch#137566

liangan1 mentioned this pull request May 7, 2025

[RFC][API-Unstable]A16W4 on XPU Device #153019

Closed

5 tasks

github-actions Bot deleted the gh/ZhiweiYan-96/32/head branch May 15, 2025 02:18

	const std::vector<sycl::event>& deps = {});
	const std::optional<std::vector<sycl::event>>& deps = std::nullopt);

		auto engine = GpuEngineManager::Instance().get_engine(cur_device);
		auto stream = GpuStreamManager::Instance().get_stream();

Conversation

ZhiweiYan-96 commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137566

❌ 4 New Failures

Uh oh!

ZhiweiYan-96 commented Oct 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 commented Oct 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liangan1 Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liangan1 Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

airMeng left a comment

Choose a reason for hiding this comment

Uh oh!

liangan1 commented Nov 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ZhiweiYan-96 commented Oct 9, 2024 •

edited

Loading

pytorch-bot Bot commented Oct 9, 2024 •

edited

Loading

ZhiweiYan-96 Dec 16, 2024 •

edited

Loading

liangan1 Oct 15, 2024 •

edited

Loading

liangan1 Oct 15, 2024 •

edited

Loading