ggml-cpu: handle 3d tensors in repack mat_mul by Alcpz · Pull Request #17241 · ggml-org/llama.cpp

Alcpz · 2025-11-13T14:49:35Z

This is a continuation of #17030 after a performance regression was reported.

Perplexity Comparison (Repack vs Non-Repack)

Command:

MODELS="unsloth/Qwen3-8B-128K-GGUF:Q4_0 ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF:Q4_0 LiquidAI/LFM2-700M-GGUF:Q4_0 LiquidAI/LFM2-1.2B-GGUF:Q4_0"
for d in build-cpu-aarm64 build-cpu-aarm64-norepack; do
    for model in $MODELS; do
        ${d}/bin/llama-perplexity -hf "$model" -f ./wikitext-2-raw/wiki.test.raw --chunks 20 -dev none
    done
done

Model	Repack PPL	Non-Repack PPL
LFM2-700M Q4_0	20.3324 ± 0.87133	20.3324 ± 0.87133
LFM2-1.2B Q4_0	15.7524 ± 0.63304	15.7524 ± 0.63304
Meta-Llama-3.1-8B-Instruct Q4_0	8.6578 ± 0.30323	8.6578 ± 0.30323
Qwen3-8B-128K Q4_0	11.1735 ± 0.48175	11.1735 ± 0.48175

Llama-bench

(M4 Max)

model	size	params	backend	threads	fa	test	t/s
qwen3 8B Q4_0	4.45 GiB	8.19 B	CPU	8	1	pp256	148.88 ± 0.60
qwen3 8B Q4_0	4.45 GiB	8.19 B	CPU	8	1	tg128	47.71 ± 0.35
llama 8B Q4_0	5.61 GiB	8.03 B	CPU	8	1	pp256	151.26 ± 1.94
llama 8B Q4_0	5.61 GiB	8.03 B	CPU	8	1	tg128	43.47 ± 0.78
lfm2 350M Q4_0	206.87 MiB	354.48 M	CPU	8	1	pp256	3248.97 ± 32.82
lfm2 350M Q4_0	206.87 MiB	354.48 M	CPU	8	1	tg128	562.68 ± 7.35
lfm2 700M Q4_0	423.37 MiB	742.49 M	CPU	8	1	pp256	1585.66 ± 13.60
lfm2 700M Q4_0	423.37 MiB	742.49 M	CPU	8	1	tg128	349.23 ± 2.42

build: c77bafd (6967) THIS PR

model	size	params	backend	threads	fa	test	t/s
qwen3 8B Q4_0	4.45 GiB	8.19 B	CPU	8	1	pp256	148.80 ± 0.18
qwen3 8B Q4_0	4.45 GiB	8.19 B	CPU	8	1	tg128	48.50 ± 0.81
llama 8B Q4_0	5.61 GiB	8.03 B	CPU	8	1	pp256	160.24 ± 0.76
llama 8B Q4_0	5.61 GiB	8.03 B	CPU	8	1	tg128	45.60 ± 0.17
lfm2 350M Q4_0	206.87 MiB	354.48 M	CPU	8	1	pp256	3269.37 ± 22.99
lfm2 350M Q4_0	206.87 MiB	354.48 M	CPU	8	1	tg128	595.18 ± 3.34
lfm2 700M Q4_0	423.37 MiB	742.49 M	CPU	8	1	pp256	1606.13 ± 8.51
lfm2 700M Q4_0	423.37 MiB	742.49 M	CPU	8	1	tg128	362.24 ± 3.19

build: 2776db6 (7047) MASTER

Alcpz · 2025-11-13T14:51:27Z

@max-krasnyansky can you please give this PR a shot and let me know if the perf is fixed? I've simplified a lot the chunking (essentially left it as it is for 2d tensors and "iterating" over planes)

max-krasnyansky · 2025-11-13T17:00:17Z

@max-krasnyansky can you please give this PR a shot and let me know if the perf is fixed? I've simplified a lot the chunking (essentially left it as it is for 2d tensors and "iterating" over planes)

Yep. Looks great! Thanks for the quick followup.
I tested llama-3.2-1B/3B and qwen3-0.6B/4B with chunking instrumentation and it generates the same number of chunks as before. The performance is the same as well, checked 2,4,6 threads on Snapdragons.

I'm marking it as ready to merge and approving.

btw If you have some more time/energy it'd be great to add chunking to the repacked mul_mat_id for the MOE models.
And we should revisit non-repacked mul_mat and mul_mat_id chunking to use this n_threads * 4 formula for the number of chunks instead of the arbitrary 16/64 that we have now.
That was/is on my TODO list after updating flash_attn and repacked mul_mat but my list is a little too long at the moment :)

Alcpz · 2025-11-13T17:56:07Z

Thanks @max-krasnyansky. I'll be collaborating every now and then, but I have a couple of implementations of the repacked q4_K to address first. Not sure if you are able to merge, if not, I'll just ping gerganov once ci passes. Thanks again!

Edit: Not sure if by marking "ready to merge" you meant to merge once ci passed

max-krasnyansky · 2025-11-13T18:04:11Z

Thanks @max-krasnyansky. I'll be collaborating every now and then, but I have a couple of implementations of the repacked q4_K to address first. Not sure if you are able to merge, if not, I'll just ping gerganov once ci passes. Thanks again!

Edit: Not sure if by marking "ready to merge" you meant to merge once ci passed

I meant switching from "Draft" to "Ready" :)
I can merge it. No worries.

* ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries * Address performance regression in Qwen and llama.cpp due to chunking

Alcpz added 7 commits November 5, 2025 18:03

ggml-cpu: handle 3d tensors in repack mul_mat

950671d

Removed unnecessary branch, removed need for <algorithm>

0b86651

Fixed dst_ptr pointer in chunk + clang_format

75c7fd5

GGML_ASSERT to check wdata within bounds

edb7f63

Accidental ggml.h inclusion

b56d0ac

Improved GGML_ASSERT on wdata boundaries

d1938ad

Address performance regression in Qwen and llama.cpp due to chunking

c77bafd

Alcpz changed the title ~~Alcpz/batched repack mul mat~~ ggml-cpu: handle 3d tensors in repack mat_mul Nov 13, 2025

DajanaV mentioned this pull request Nov 13, 2025

UPSTREAM PR #17241: ggml-cpu: handle 3d tensors in repack mat_mul auroralabs-loci/llama.cpp#191

Open

max-krasnyansky marked this pull request as ready for review November 13, 2025 17:00

max-krasnyansky requested review from ggerganov and slaren as code owners November 13, 2025 17:00

max-krasnyansky approved these changes Nov 13, 2025

View reviewed changes

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 13, 2025

max-krasnyansky merged commit becc481 into ggml-org:master Nov 13, 2025
63 of 67 checks passed

Alcpz mentioned this pull request Nov 14, 2025

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Merged

Alcpz deleted the Alcpz/batched_repack_mul_mat branch November 27, 2025 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: handle 3d tensors in repack mat_mul#17241

ggml-cpu: handle 3d tensors in repack mat_mul#17241
max-krasnyansky merged 7 commits intoggml-org:masterfrom
Alcpz:Alcpz/batched_repack_mul_mat

Alcpz commented Nov 13, 2025 •

edited

Loading

Uh oh!

Alcpz commented Nov 13, 2025

Uh oh!

max-krasnyansky commented Nov 13, 2025

Uh oh!

Alcpz commented Nov 13, 2025 •

edited

Loading

Uh oh!

max-krasnyansky commented Nov 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alcpz commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perplexity Comparison (Repack vs Non-Repack)

Llama-bench

Uh oh!

Alcpz commented Nov 13, 2025

Uh oh!

max-krasnyansky commented Nov 13, 2025

Uh oh!

Alcpz commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alcpz commented Nov 13, 2025 •

edited

Loading

Alcpz commented Nov 13, 2025 •

edited

Loading

max-krasnyansky commented Nov 13, 2025 •

edited

Loading