sycl : Implemented reorder Q4_K mmvq#13109
Conversation
NeoZhangJianyu
left a comment
There was a problem hiding this comment.
- Could you share the GPU type of above test result?
- Have you test the PR by local UT?
- Could you check the detailed output of Q4_K LLM?
I guess the output should be different to legacy code.
ggml/src/ggml-sycl/ggml-sycl.cpp
Outdated
There was a problem hiding this comment.
I have same comment and concern:
This code will impact the code path of below. That would lead to the wrong result.
I suggest this PR only optimize the mmvq() function.
You could add another PR to optimize by changing the code path, like by this sentence.
There was a problem hiding this comment.
OK! Let's focus on the review of #12858 firtly.
ggml/src/ggml-sycl/ggml-sycl.cpp
Outdated
There was a problem hiding this comment.
Could you follow the solution of PR #13003?
It fixed base issue of reorder Q4_0.
There was a problem hiding this comment.
I rebased this branch and will rebase it again when #12858 is merged, so these changes should make it into this PR.
There was a problem hiding this comment.
Depending on the discussion WRT opt_for_reorder and how to call the function, this will require another rebase, sorry about that. Will keep it to a single commit so you can cherry pick with no issues.
There was a problem hiding this comment.
OK! Let's focus on the review of #12858 firtly.
|
@sgeor255 We need promote SYCL backend in related cases. :) |
d1f5b2d to
105a01d
Compare
|
I rebased the PR on @Alcpz 's latest changes & updated the description with more performance numbers. |
|
@NeoZhangJianyu to answer your questions:
I updated the PR description with results from more devices.
Unit tests pass locally (if I understood the question correctly?).
I ran the example script with master @ 8936784This PR |
|
@sgeor255 I cannot resolve my comments (because the "resolve conversation " button is just isn't there for me), consider them resolved 👍🏻 |
ggml/src/ggml-sycl/ggml-sycl.cpp
Outdated
There was a problem hiding this comment.
Depending on the discussion WRT opt_for_reorder and how to call the function, this will require another rebase, sorry about that. Will keep it to a single commit so you can cherry pick with no issues.
105a01d to
685e02b
Compare
https://github.com/NeoZhangJianyu there's a small improvement for this model too:
build: 105a01d7 (5223)
build: 105a01d7 (5223) |
685e02b to
d7e5179
Compare
d7e5179 to
bb10b4a
Compare
bb10b4a to
f7e7d2a
Compare
|
This PR is now rebased on master as #12858 was merged. |
f7e7d2a to
2a2aef0
Compare
|
|
||
| static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | ||
|
|
||
| static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { |
There was a problem hiding this comment.
| static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | |
| static bool choose_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { |
There was a problem hiding this comment.
It's nit but can_use seems more accurate to me since there are more logic later on to make the final decision on the matmul implementation.
There was a problem hiding this comment.
We should use one function to detect if need call deq_mul_mat_vec(), instead of several.
There was a problem hiding this comment.
What I mean is that the final choice depends on the outputs from can_use_dequantize_mul_mat_vec and can_use_mul_mat_vec_q so it can't all be in a single choose_dequantize_mul_mat_vec currently.
There was a problem hiding this comment.
Yes, I agree with @Rbiessy and that's why it is currently implemented this way.
| src0->ne[0] % GGML_SYCL_DMMV_X == 0 && src1->ne[1] == 1; | ||
| } | ||
|
|
||
| static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { |
There was a problem hiding this comment.
| static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | |
| static bool choose_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { |
|
Merging now since this PR includes an important fix with the reorder optimization mentioned here: #13109 (comment) |
This PR enables reorder optimization for Q4_K layout similarly to #12858 . This branch is based off of @Alcpz 's and before that is merged the easiest way to review it is looking at the diff for d1f5b2d740970c22de4a1cba7e0df0de0739831e .
Some performance numbers below:
Lunar lake
GGML_SYCL_DISABLE_OPT=0build: 105a01d7 (5223)
GGML_SYCL_DISABLE_OPT=1build: 105a01d7 (5223)
Arc B580 (Battlemage)
GGML_SYCL_DISABLE_OPT=0build: 105a01d7 (5223)
GGML_SYCL_DISABLE_OPT=1build: 105a01d7 (5223)
Arc A770
GGML_SYCL_DISABLE_OPT=0build: 105a01d7 (5223)
GGML_SYCL_DISABLE_OPT=1build: 105a01d7 (5223)