Skip to content

linear_int4_kernel for XPU#1130

Merged
sunjiweiswift merged 36 commits into
mainfrom
fp_zp
Jan 6, 2025
Merged

linear_int4_kernel for XPU#1130
sunjiweiswift merged 36 commits into
mainfrom
fp_zp

Conversation

@sunjiweiswift

@sunjiweiswift sunjiweiswift commented Nov 29, 2024

Copy link
Copy Markdown
Contributor

Pure SYCL path for. int4 gemm

Benchmark results on PVC-1100. The remaining gaps are lack of usage of 2D load.

M K N SrcT WeiT DstT Bandwidth usage (BW usage)
1 4096 4096 float16 float16 float16 53.7%
1 4096 11008 float16 float16 float16 57.4%
1 4096 16384 float16 float16 float16 59.7%
1 12288 4096 float16 float16 float16 77.3%

Besides PVC, the kernel can achieve
92.7% bandwidth usage on MTL
84.7% bandwidth usage on A750

Reset to
bfdbaf4

---------

Co-authored-by: mengfei25 <mengfei.li@Intel.com>
Co-authored-by: LuFengqing <fengqing.lu@intel.com>
Co-authored-by: Ratnam Parikh <114774508+ratnampa@users.noreply.github.com>
Co-authored-by: Feng Yuan <feng1.yuan@intel.com>
@sunjiweiswift sunjiweiswift changed the title Fp zp linear_int4_kernel for XPU Nov 29, 2024

@mingfeima mingfeima left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the biggest question should be why we need post op fusion here? does pytorch have it with cuda?

Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread test/xpu/test_int4_linear.py Outdated
Comment thread test/xpu/test_int4_linear.py Outdated
@mingfeima

Copy link
Copy Markdown

@liangan1 CC

@mingfeima

Copy link
Copy Markdown

@sunjiweiswift for the perf benchmarking, please include other configs expect M=1. This would serve as a reference of final decision making. I expect that big M would have worse perf, but that's fine, we still need to know the numbers.

toyxu and others added 2 commits December 3, 2024 14:55
#### Bugfix

- [add lazy init for
empty_xpu](#1115)
- [nan propagation for
soft_shrink](https://github.com/intel/torch-xpu-ops/pull/1116/files#diff-b7cb5876d000db957286c8b0e72badb2b7502402c8955334f1cc21c34c98a5b9)

---------

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com>
@sunjiweiswift sunjiweiswift force-pushed the fp_zp branch 2 times, most recently from faa79b7 to 5a08d2e Compare December 9, 2024 05:25
@airMeng airMeng requested a review from mingfeima December 20, 2024 07:09
Comment thread test/xpu/test_linalg_xpu.py Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated
Comment thread src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated

@mingfeima mingfeima left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally LGTM. good job:)

for (int i = 0; i < k; i += GroupK * Unroll) {
#pragma unroll
for (int iu = 0; iu < Unroll; iu++) {
uint8_t tmps8[TileK / 2];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can do a little template trick to simply this piece of logic, have a template that handles all scernios and then pass corresponding args when called.

template <typename scalar_t, int SgSize, int TileK, int Unroll>
void tinygemm_kernel(...)

if (k % (SgSize * 32 * Unroll) == 0) {
  // use tinygemm_kernel<...>
else {
  // use tinygemm_kernel<...>
}

not a must to have, just a little trick.

@airMeng

airMeng commented Dec 23, 2024

Copy link
Copy Markdown
Contributor

@xytintel not only this PR but the latest several CI all failed, could you check?

2024-12-23T02:49:25.9617774Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp: In function ‘bool sdp::can_use_mem_efficient_attention(sdp::sdp_params, bool)’:
2024-12-23T02:49:25.9624103Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:37:7: error: ‘array_of’ was not declared in this scope; did you mean ‘c10::array_of’?
2024-12-23T02:49:25.9625568Z    37 |       array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9626012Z       |       ^~~~~~~~
2024-12-23T02:49:25.9626381Z       |       c10::array_of
2024-12-23T02:49:25.9627538Z In file included from /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:2:
2024-12-23T02:49:25.9629144Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/c10/util/Array.h:14:23: note: ‘c10::array_of’ declared here
2024-12-23T02:49:25.9630059Z    14 | inline constexpr auto array_of(T&&... t) -> std::array<V, sizeof...(T)> {
2024-12-23T02:49:25.9630604Z       |                       ^~~~~~~~
2024-12-23T02:49:25.9631919Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:37:16: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9633177Z    37 |       array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9633623Z       |                ^~~~
2024-12-23T02:49:25.9635453Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:50:18: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9636852Z    50 |         array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9637304Z       |                  ^~~~
2024-12-23T02:49:25.9638691Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:63:18: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9639958Z    63 |         array_of<bool (*)(sdp_params const&, bool)>(

@mingfeima

Copy link
Copy Markdown

@EikanWang @liangan1 thoughts?

@sunjiweiswift sunjiweiswift added this pull request to the merge queue Jan 6, 2025
Merged via the queue into main with commit d4432d0 Jan 6, 2025
@sunjiweiswift sunjiweiswift deleted the fp_zp branch January 6, 2025 08:25
ZhiweiYan-96 added a commit that referenced this pull request Jan 16, 2025
Pure SYCL path for. int4 gemm

Benchmark results on PVC-1100. The remaining gaps are lack of usage of
2D load.

| M | K | N | SrcT   | WeiT   | DstT   | Bandwidth usage (BW usage) |

|------------|-------------|-------------|----------|----------|----------|----------------|
| 1 | 4096 | 4096 | float16 | float16 | float16 | 53.7% |
| 1 | 4096 | 11008 | float16 | float16 | float16 | 57.4% |
| 1 | 4096 | 16384 | float16 | float16 | float16 | 59.7% |
| 1 | 12288 | 4096 | float16 | float16 | float16 | 77.3% |



Besides PVC, the kernel can achieve 
92.7% bandwidth usage on MTL
84.7% bandwidth usage on A750

---------

Co-authored-by: Yutao Xu <yutao.xu@intel.com>
Co-authored-by: mengfei25 <mengfei.li@Intel.com>
Co-authored-by: LuFengqing <fengqing.lu@intel.com>
Co-authored-by: Ratnam Parikh <114774508+ratnampa@users.noreply.github.com>
Co-authored-by: Feng Yuan <feng1.yuan@intel.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants