linear_int4_kernel for XPU by sunjiweiswift · Pull Request #1130 · intel/torch-xpu-ops

sunjiweiswift · 2024-11-29T09:29:31Z

Pure SYCL path for. int4 gemm

Benchmark results on PVC-1100. The remaining gaps are lack of usage of 2D load.

M	K	N	SrcT	WeiT	DstT	Bandwidth usage (BW usage)
1	4096	4096	float16	float16	float16	53.7%
1	4096	11008	float16	float16	float16	57.4%
1	4096	16384	float16	float16	float16	59.7%
1	12288	4096	float16	float16	float16	77.3%

Besides PVC, the kernel can achieve
92.7% bandwidth usage on MTL
84.7% bandwidth usage on A750

Reset to bfdbaf4 --------- Co-authored-by: mengfei25 <mengfei.li@Intel.com> Co-authored-by: LuFengqing <fengqing.lu@intel.com> Co-authored-by: Ratnam Parikh <114774508+ratnampa@users.noreply.github.com> Co-authored-by: Feng Yuan <feng1.yuan@intel.com>

mingfeima

the biggest question should be why we need post op fusion here? does pytorch have it with cuda?

mingfeima · 2024-12-02T02:11:04Z

@liangan1 CC

mingfeima · 2024-12-02T02:18:00Z

@sunjiweiswift for the perf benchmarking, please include other configs expect M=1. This would serve as a reference of final decision making. I expect that big M would have worse perf, but that's fine, we still need to know the numbers.

#### Bugfix - [add lazy init for empty_xpu](#1115) - [nan propagation for soft_shrink](https://github.com/intel/torch-xpu-ops/pull/1116/files#diff-b7cb5876d000db957286c8b0e72badb2b7502402c8955334f1cc21c34c98a5b9) --------- Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com>

Resolve: pytorch/pytorch#142102

mingfeima

generally LGTM. good job:)

mingfeima · 2024-12-23T03:04:38Z

+      for (int i = 0; i < k; i += GroupK * Unroll) {
+#pragma unroll
+        for (int iu = 0; iu < Unroll; iu++) {
+          uint8_t tmps8[TileK / 2];


maybe we can do a little template trick to simply this piece of logic, have a template that handles all scernios and then pass corresponding args when called.

template <typename scalar_t, int SgSize, int TileK, int Unroll> void tinygemm_kernel(...) if (k % (SgSize * 32 * Unroll) == 0) { // use tinygemm_kernel<...> else { // use tinygemm_kernel<...> }

not a must to have, just a little trick.

airMeng · 2024-12-23T03:46:42Z

@xytintel not only this PR but the latest several CI all failed, could you check?

2024-12-23T02:49:25.9617774Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp: In function ‘bool sdp::can_use_mem_efficient_attention(sdp::sdp_params, bool)’:
2024-12-23T02:49:25.9624103Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:37:7: error: ‘array_of’ was not declared in this scope; did you mean ‘c10::array_of’?
2024-12-23T02:49:25.9625568Z    37 |       array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9626012Z       |       ^~~~~~~~
2024-12-23T02:49:25.9626381Z       |       c10::array_of
2024-12-23T02:49:25.9627538Z In file included from /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:2:
2024-12-23T02:49:25.9629144Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/c10/util/Array.h:14:23: note: ‘c10::array_of’ declared here
2024-12-23T02:49:25.9630059Z    14 | inline constexpr auto array_of(T&&... t) -> std::array<V, sizeof...(T)> {
2024-12-23T02:49:25.9630604Z       |                       ^~~~~~~~
2024-12-23T02:49:25.9631919Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:37:16: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9633177Z    37 |       array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9633623Z       |                ^~~~
2024-12-23T02:49:25.9635453Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:50:18: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9636852Z    50 |         array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9637304Z       |                  ^~~~
2024-12-23T02:49:25.9638691Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:63:18: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9639958Z    63 |         array_of<bool (*)(sdp_params const&, bool)>(

mingfeima · 2025-01-02T06:39:28Z

@EikanWang @liangan1 thoughts?

Pure SYCL path for. int4 gemm Benchmark results on PVC-1100. The remaining gaps are lack of usage of 2D load. | M | K | N | SrcT | WeiT | DstT | Bandwidth usage (BW usage) | |------------|-------------|-------------|----------|----------|----------|----------------| | 1 | 4096 | 4096 | float16 | float16 | float16 | 53.7% | | 1 | 4096 | 11008 | float16 | float16 | float16 | 57.4% | | 1 | 4096 | 16384 | float16 | float16 | float16 | 59.7% | | 1 | 12288 | 4096 | float16 | float16 | float16 | 77.3% | Besides PVC, the kernel can achieve 92.7% bandwidth usage on MTL 84.7% bandwidth usage on A750 --------- Co-authored-by: Yutao Xu <yutao.xu@intel.com> Co-authored-by: mengfei25 <mengfei.li@Intel.com> Co-authored-by: LuFengqing <fengqing.lu@intel.com> Co-authored-by: Ratnam Parikh <114774508+ratnampa@users.noreply.github.com> Co-authored-by: Feng Yuan <feng1.yuan@intel.com> Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>

sunjiweiswift changed the title ~~Fp zp~~ linear_int4_kernel for XPU Nov 29, 2024

mingfeima requested changes Dec 2, 2024

View reviewed changes

toyxu and others added 2 commits December 3, 2024 14:55

[Release-2.6] Capture rrelu_with_noise noise mutation in compile (#1145)

7ecb0b1

Resolve: pytorch/pytorch#142102

sunjiweiswift force-pushed the fp_zp branch 2 times, most recently from faa79b7 to 5a08d2e Compare December 9, 2024 05:25

airMeng and others added 15 commits December 11, 2024 09:07

contiguous layout for sycl int4 kernel

5410f51

push without compile

e9311a3

update linearkernel

e3eaffa

fix some comiple error(not all)

2a664af

add sycl_ker_config_convention

0156ba5

reg kernel for pytorch

a58afec

add yaml for int4mm

f487b20

update yaml file

ce1c894

Modified some review comments

d61b198

modify fun name

d76a0ce

autogen: _weight_int4pack_mm_with_scales_and_zeros.out

870a3b5

param int->int64_t(python int is int64)

a9627f6

use AT_DISPATCH_FLOATING_TYPES_AND

952ead9

Keep the same name as pytorch's _weight_int4pack_mm

93804f9

modify UT for int4

9e50b68

sunjiweiswift force-pushed the fp_zp branch from 2424d54 to 4dfd8bd Compare December 12, 2024 07:13

sync UT with pytoch UT(linalg)

81a72f1

sunjiweiswift force-pushed the fp_zp branch from 4dfd8bd to 81a72f1 Compare December 12, 2024 07:15

sunjiweiswift added 3 commits December 12, 2024 07:23

col-major

a70df0a

UT pass for B ones

c08382c

update gemv

14bb4e0

sunjiweiswift force-pushed the fp_zp branch from 78433cb to d6a2f3a Compare December 18, 2024 09:07

sunjiweiswift and others added 2 commits December 20, 2024 05:29

save

27f18c2

Merge branch 'main' into fp_zp

7f94b9b

airMeng requested a review from mingfeima December 20, 2024 07:09

airMeng reviewed Dec 20, 2024

View reviewed changes

Comment thread test/xpu/test_linalg_xpu.py Outdated

bugfix for Big Endian

42c18e9

airMeng reviewed Dec 20, 2024

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated

Unify BF16 and FP16 Funtion

d832050

sunjiweiswift force-pushed the fp_zp branch from 6ecfa50 to d832050 Compare December 20, 2024 09:27

sunjiweiswift requested a review from airMeng December 20, 2024 09:33

fix compile warning

8385f7e

airMeng reviewed Dec 20, 2024

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated

airMeng reviewed Dec 20, 2024

View reviewed changes

Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated

modify by review

f44ed70

mingfeima approved these changes Dec 23, 2024

View reviewed changes

sunjiweiswift added 3 commits December 24, 2024 16:58

Merge branch 'main' into fp_zp

09696b1

Merge branch 'main' into fp_zp

ebe8c7c

Merge branch 'main' into fp_zp

ce6c16b

mingfeima requested review from EikanWang and liangan1 January 2, 2025 06:39

sunjiweiswift added 2 commits January 3, 2025 09:45

Merge branch 'main' into fp_zp

dacf3b9

Merge branch 'main' into fp_zp

8a5c000

sunjiweiswift added this pull request to the merge queue Jan 6, 2025

Merged via the queue into main with commit d4432d0 Jan 6, 2025

sunjiweiswift deleted the fp_zp branch January 6, 2025 08:25

airMeng mentioned this pull request Jan 17, 2025

INT4 XPU enabling pytorch/ao#1577

Merged

Conversation

sunjiweiswift commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mingfeima commented Dec 2, 2024

Uh oh!

mingfeima commented Dec 2, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

mingfeima Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

airMeng commented Dec 23, 2024

Uh oh!

mingfeima commented Jan 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sunjiweiswift commented Nov 29, 2024 •

edited

Loading