ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel #19132
ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel #19132taronaeo merged 7 commits intoggml-org:masterfrom
Conversation
|
cc @Alcpz |
|
Regarding CI Failure When I ran the same command on my system, it build correctly with no issue. Can we check or Rerun the CI pipeline. We have not made any changes in CMake or x86 code. I am attaching the logs. |
| constexpr int q8_k_blocklen = 4; | ||
| const uint8x16_t m4b = vdupq_n_u8(0x0f); | ||
| #if defined(__aarch64__) && defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_MATMUL_INT8) | ||
| if (svcntb()*8 == 256) { |
| } | ||
|
|
||
| // q8_ptr[b].qs has interleaved Q8 rows (01, 23) | ||
| // const int8_t * q8_base = q8_ptr[b].qs + sb * 256; |
There was a problem hiding this comment.
There is redundant commented code. Some comments could be improved a bit as well.
|
|
||
| for (int y = 0; y < nr / q8_k_blocklen; y++) { | ||
| const block_q8_Kx4 * GGML_RESTRICT q8_ptr = (const block_q8_Kx4 *) vy + (y * nb); | ||
| const block_q8_Kx4 * GGML_RESTRICT q8_ptr_1 = (const block_q8_Kx4 *) vy + (y * nb); |
There was a problem hiding this comment.
I don't understand the need for the same variable twice, I don't see them being used in a way that makes this necessary. Either clarify or cleanup.
| acc_f32_67 = svdup_n_f32(0); | ||
|
|
||
| for (int b = 0; b < nb; b++) { | ||
| // bsums pairs belongs to the same q8_k subblock // 64 elemnts loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum |
There was a problem hiding this comment.
| // bsums pairs belongs to the same q8_k subblock // 64 elemnts loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum | |
| // bsums pairs belongs to the same q8_k subblock | |
| // 64 elements loaded and made sum of 0-7 and 8-15 sum || 16-23 and 24 - 31 sum |
The Server failures are due to changes in the CI. If you rebase on top of master you should get rid of those. I also saw some issues with the `x86 high performance job failing on other pipelines, but as you say, this is not caused here. |
c75f491 to
1d4d342
Compare
|
@Alcpz rebase and format related changes are pushed, Thank you ! |
|
Hi @ggerganov and @Alcpz Please review the code. I don't why @loci-dev is not showing the performance gain. My guess he might not utilizing the SVE code. I am to good help if you need anything else. Thank you. |
I've already gave my review, unfortunately, I don't have access to an SVE system so I can't really dig further into it. I'm trusting everything was tested accordingly since it's GEMM, so Perplexity should be enough to detect failures. Sorry I can't be of more help. |
This comment was marked as outdated.
This comment was marked as outdated.
|
Hi @taronaeo Thanks for running the benchmark on Graviton 4 has SVE vector length of 128-bits, and current code is written for SVE vector length 256 bits. So, when you are running with this PR it does not utilize both NEON and SVE code and uses Now currently I have modified the code such way that if So now with these changes, you will see the similar performance for both NEON and this PR results. I am attaching my runtime results for For NEON
I hope this clears your doubt. Thank you. |
Great catch, I forgot about that. I can retest it on Graviton 3 where 256-bit SVE is available and update the benchmarks again :) |
|
I've tested this using an AWS
|
|
Merge on green. |
|
Can you rebase this PR with |
…nce slowing the performance. So added code if SVE 256 is not present then use NEON code.
64b91e6 to
15ddb81
Compare
|
Hi @taronaeo, @ggerganov, Could you please assist with the CI failure? The failure appears to be related to riscv64, and I am unable to reproduce the issue locally. |
|
The CI failure is consistent across other PRs as well and unrelated to this PR AFAICT. Will merge now. |
* upstream/master: (88 commits) ci : bump komac version (ggml-org#19682) build : link ws2_32 as PUBLIC on Windows (ggml-org#19666) build : cleanup library linking logic (ggml-org#19665) convert : add JoyAI-LLM-Flash (ggml-org#19651) perplexity: add proper batching (ggml-org#19661) common : inline functions (ggml-org#18639) ggml : make `ggml_is_view` as API (ggml-org#19539) model: Add support for Tiny Aya Models (ggml-org#19611) build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658) Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591) models : deduplicate delta-net graphs for Qwen family (ggml-org#19597) graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644) ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (ggml-org#19132) sync : ggml ggml : bump version to 0.9.7 (ggml/1425) ggml : bump version to 0.9.6 (ggml/1423) cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624) docs: update s390x build docs (ggml-org#19643) build : remove LLAMA_HTTPLIB option (ggml-org#19623) cmake : check if KleidiAI API has been fetched (ggml-org#19640) ...
…9132) * Updated repack.cpp * Updated repack.cpp * Updated repack.cpp * Added if condition to support only vector length 256. * Changed the format removed comments and duplicate variable * If SVE 256 not present then was using generic function to compute, hence slowing the performance. So added code if SVE 256 is not present then use NEON code. * Code format change suggestion --------- Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com>
…9132) * Updated repack.cpp * Updated repack.cpp * Updated repack.cpp * Added if condition to support only vector length 256. * Changed the format removed comments and duplicate variable * If SVE 256 not present then was using generic function to compute, hence slowing the performance. So added code if SVE 256 is not present then use NEON code. * Code format change suggestion --------- Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com>
…9132) * Updated repack.cpp * Updated repack.cpp * Updated repack.cpp * Added if condition to support only vector length 256. * Changed the format removed comments and duplicate variable * If SVE 256 not present then was using generic function to compute, hence slowing the performance. So added code if SVE 256 is not present then use NEON code. * Code format change suggestion --------- Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com>
This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q4_K_q8_K gemm using i8mm and vector instructions. ARM Neon support for this kernel added in PR #16739
Verifying Feature
----------------------------------------------------------------------------This PR contains the SVE implementation of the gemm used to compute the Q4_K quantization.
Kernel: ggml_gemm_q4_K_8x8_q8_K()By running a Q4_K_M quantized model of Llama-3.1-8B, I checked the generation output.
I also verified that the perplexity matches between the NEON and SVE implementations.
This correction does not appear to have any impact on accuracy.
The command used to measure the perplexity measure is
Performance Check
----------------------------------------------------------------------------This PR Improves the Prompt Eval time (TTFT) of LLM Inference by 17-20%, as compared to NEON (PR #16739).
The performance was measured on Graviton3E @ 64 core.
Performance is improved as follows. The value is tokens per second.
The command used to measure the performance is
This work is a contribution of @Vithulep and @abhijain1204fujitsu