Q4/Q8 Tiled Gemm Optimization. by shalinib-ibm · Pull Request #16999 · ggml-org/llama.cpp

shalinib-ibm · 2025-11-04T13:48:50Z

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul.

10 ~ 30 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).

Make sure to read the contributing guidelines before submitting a PR

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul. 30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0). Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

shalinib-ibm · 2025-11-04T13:50:58Z

@taronaeo Can you please review this PR ?

shalinib-ibm · 2025-11-05T13:48:36Z

@ggerganov Can you please review this PR?

ggerganov · 2025-11-05T15:27:28Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+
+#include <pthread.h>
+
+typedef vector unsigned char vec_t;
+typedef __vector_quad acc_t;
+
+static pthread_key_t t_data_key;
+typedef struct {
+    vec_t* A_pack;
+    vec_t* B_pack;
+    int* comparray;
+} thread_scratchpad_t;
+void thread_cleanup(void* arg) {
+    thread_scratchpad_t* data = (thread_scratchpad_t*)arg;
+    if (data) {
+        delete[] data->A_pack;
+        delete[] data->B_pack;
+        delete[] data->comparray;
+
+        delete data;
+    }
+}
+static bool key_created = false;
+


It would be better to avoid dynamic allocations - none of the code currently uses those. The mechanism for this is to use the wdata from ggml_compute_params to store scratch data. You'll need to reserve the worst-case wsize for your case.

@ggerganov Thank you for the input. I tried to avoid dynamic allocation from the code, but lost perforamce without this pthread based code. Below is the code Performance comparison after integrating thread-local scratchpad using wdata.

void matmul_tiled(const ggml_compute_params* params, int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) { char* wdata = (char*) params->wdata; constexpr size_t ALIGN = 128; auto align_ptr = [&](char* ptr, size_t alignment) { return (char*)(((uintptr_t)ptr + alignment - 1) & ~(alignment - 1)); }; char* ptr = align_ptr(wdata, ALIGN); vec_t* A_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * mc * kc * 2; vec_t* B_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * nc * kc * 2; int* comparray = (int*)align_ptr(ptr, ALIGN); // integer part aligned too ptr += sizeof(int) * mc * kc; // rest of your original matmul_tiled() code unchanged }

Benchmark (llama-bench) Baseline pthread-based TLS ggml wdata-based TLS

pp128 69 t/s 89 t/s 36 t/s

pp256 69 t/s 94 t/s 36 t/s

This regression is likely due to:

Loss of persistent per-thread cache locality — the previous pthread-based version reused buffers effectively across tiles.

Higher memory initialization or shared buffer contention across threads.

I have also tried static allocation on stack by having just this code , But it has suffers from similar perf loss. ( 38 t/s)
vec_t A_pack [mckc2];
vec_t B_pack[nckc2];
int comparray[mc*kc];

Can you please suggest ?

void matmul_tiled()

You wdata pointer has to take into account the thread ID. For example:

llama.cpp/ggml/src/ggml-cpu/ops.cpp

Lines 615 to 617 in 7f09a68

float * wdata = (float *) params->wdata + (ne00 + CACHE_LINE_SIZE_F32) * ith;

From what I can tell from your codeblock, all threads are currently working on the same wdata range and most likely they are data racing. Unless I'm missing something 🤔

@ggerganov Thank you so much for the input. I have implemented the approach you suggested and saw we get best performance with pthread based dynamic memory allocation only . Here is the code and results.

void matmul_tiled(const struct ggml_compute_params* params, int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) { const int ith = params->ith; const int nth = params->nth; const int64_t TILE_SIZE = 64; const size_t vec_t_sz = 16; const size_t int_sz = 4; const size_t align = (size_t)GGML_CACHE_LINE_SIZE; const size_t A_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz; const size_t B_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz; const size_t C_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * int_sz; const size_t A_aligned = GGML_PAD(A_raw_bytes, align); const size_t B_aligned = GGML_PAD(B_raw_bytes, align); const size_t C_aligned = GGML_PAD(C_raw_bytes, align); const size_t S_PER_THREAD_MAX = GGML_PAD(A_aligned + B_aligned + C_aligned, align); uint8_t* base_u8 = reinterpret_cast<uint8_t*>(params->wdata); uint8_t* thread_base_unaligned = base_u8 + (S_PER_THREAD_MAX +(align-1))* (size_t)ith; uint8_t* p = (uint8_t*)GGML_PAD((uintptr_t)thread_base_unaligned, align); vec_t* A_pack = reinterpret_cast<vec_t*>(p); p += A_aligned; vec_t* B_pack = reinterpret_cast<vec_t*>(p); p += B_aligned; int* comparray = reinterpret_cast<int*>(p); constexpr bool is_Ablock_q4 = std::is_same_v<TA, block_q4_0>; int64_t ytiles = m / mc; int64_t xtiles = n / nc; int64_t tiles = xtiles * ytiles; int64_t duty = (tiles + nth - 1) / nth; int64_t start = duty * ith; int64_t end = start + duty; if (end > tiles) { end = tiles; } for (int64_t job = start; job < end; ++job) { int64_t ii = (job / xtiles) * mc; int64_t jj = (job % xtiles) * nc; for (int64_t kk = 0; kk < k; kk += kc) { if constexpr(is_Ablock_q4) { packNormalInt4_large(A + ii*lda + kk, lda, mc, 4, (int8_t*)A_pack, comparray); } else { packNormal_large<int8_t, vector signed char>(A + ii*lda + kk, lda, mc, 8, (int8_t*)A_pack, false, comparray); } packNormal_large<uint8_t, vector unsigned char>(B + jj*ldb + kk, ldb, nc, 8, (uint8_t*)B_pack, true); KERNEL_Q0(ii, jj, mc, nc, kc, kk, A_pack, B_pack, comparray); } } }

Summary of Thread Model Performance Evaluation (Power10)

We compared three builds of llama.cpp on Power10 for the same configuration (Meta-Llama-3-8B Q4_0, 20 threads, prompt 128, 1 token with llama-bench):

Build Type pp128 t/s Cycles (↓ better) IPC Elapsed Time (s)

Base (upstream) 68.08 841B 2.56 13.34

GGML thread patch 52.32 1076B 1.59 16.60

Pthread-based patch 84.27 625B 2.46 11.04

Observations

The ggml-thread patch shows a ~25% regression vs base (pp128: 68 → 52 t/s) and a ~28% increase in total cycles, indicating higher synchronization or scheduling overhead.

The pthread-based version outperforms both:

+24% faster than base for pp128 (84.27 vs 68.08 t/s),

~34% fewer cycles and ~17% lower elapsed time (11.0s vs 13.3s),

IPC and cache behavior remain healthy and consistent.

Given these results:

The params->wdata approach adds noticeable overhead on Power10.

The pthread-based implementation provides clear performance benefits and better scaling with available cores.

Hi @ggerganov
I’ve added a note explaining that the params->wdata approach didn’t provide benefits.
When you have some time, could you please take another look at the patch?
Thank you!

Removed dynamic memory allocation and uploaded new code.

@shalinib-ibm For my understanding, is the performance now as good as with the previous approach?

@ggerganov . Thank you for your time.
Pthread based dynamic memory allocation using malloc was giving better perf by ~ 10%. Below are the results.
./build_base/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 37.81 ± 0.01

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 38.89 ± 0.01

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 39.05 ± 0.18

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 38.52 ± 0.09

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 37.86 ± 0.02

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg1 13.59 ± 0.02

build: 0bcb40b (6833)
./build_dynamic_alloc/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 37.87 ± 0.03

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 38.93 ± 0.01

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 53.33 ± 0.01

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 54.35 ± 0.01

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 52.98 ± 0.00

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg1 13.60 ± 0.02

build: 2e669d22d (6834)

./build_patch/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 40.88 ± 0.02

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 44.79 ± 0.01

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 47.14 ± 0.01

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 48.24 ± 0.02

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 47.15 ± 0.01

llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg1 13.85 ± 0.01

build: 962d4e985 (6836)

shalinib-ibm · 2025-12-02T15:08:53Z

@ggerganov I have addressed your review comment. This patch does not use dynamic allocation now.
Kindly review the patch. Thanks in advance for your time.

shalinib-ibm · 2025-12-03T04:38:00Z

@taronaeo Can you please review this patch?

taronaeo · 2025-12-03T05:33:28Z

@taronaeo Can you please review this patch?

I can review this tomorrow, and can only review it at a high-level. I have no PPC hardware to test this on.

shalinib-ibm · 2025-12-03T08:17:53Z

@taronaeo Can you please review this patch?

I can review this tomorrow, and can only review it at a high-level. I have no PPC hardware to test this on.
@taronaeo These results of llama-bench with for Meta-Llama3-8b Q4 model and granite4.0-h-mico Q8 Model on ppc64le linux power10 box.
Similar gains are observed with llama-batched-bench as well.

/build_base/bin/llama-bench -m /home/shalini/Models/granite-4.0-h-micro-Q8_0.gguf -p 16,32,64,128,256 -n 1

model	size	params	backend	threads	test	t/s
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp16	70.53 ± 0.12
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp32	74.75 ± 0.83
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp64	75.59 ± 0.28
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp128	75.43 ± 0.32
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp256	75.24 ± 0.18
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	tg1	18.80 ± 0.04

build: 0bcb40b (6833)
./build_patch_8cc/bin/llama-bench -m /home/shalini/Models/granite-4.0-h-micro-Q8_0.gguf -p 16,32,64,128,256 -n 1

model	size	params	backend	threads	test	t/s
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp16	77.21 ± 0.22
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp32	86.02 ± 0.09
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp64	90.40 ± 0.04
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp128	96.08 ± 0.06
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	pp256	98.53 ± 0.01
granitehybrid 3B Q8_0	3.16 GiB	3.19 B	CPU	10	tg1	18.81 ± 0.03

build: 962d4e985 (6836)
./build_base/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp16	37.83 ± 0.02
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp32	38.90 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp64	39.11 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp128	38.83 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp256	37.84 ± 0.00
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	tg1	13.61 ± 0.02

build: 0bcb40b (6833)
./build_patch_8cc/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp16	40.81 ± 0.03
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp32	44.73 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp64	47.09 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp128	48.20 ± 0.04
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp256	47.10 ± 0.00
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	tg1	13.84 ± 0.02

build: 962d4e985 (6836)

./llama-cli -p 'please write a python progarm to print nth fibonacci number' -n 128 -no-cnv ->
gives 27 t/s for PP on patch while base gives 23 t/s for PP.

ggml/src/ggml-cpu/llamafile/sgemm-ppc.h

This commit addresses review comments. Also, we have saperated out legacy mnpack path and matmul_tiled paths for tinyBLAS_Q0_PPC class. 10 ~ 30% improvement in PP Speed with Q4_0 and Q8_0 Models. Tested with Meta-Llama3-8B quatized models with llama-bench, llama-batched-bench. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

ggerganov

Make sure you ran perplexity calculations to verify the results are good. Wait for @taronaeo's review

shalinib-ibm · 2025-12-03T10:17:30Z

Make sure you ran perplexity calculations to verify the results are good.

Thank you @ggerganov . Here are the perplexity results run with Meta Llama3-8B q4 model.

yes "The quick brown fox jumps over the lazy dog." | head -n 300 > fox.txt

BASE
perplexity: tokenizing the input ..
perplexity: tokenization took 1.156 ms
perplexity: calculating perplexity over 5 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 60.34 seconds per pass - ETA 1.25 minutes
[1]1.0092,[2]1.0089,[3]1.0091,[4]1.0092,[5]1.0086,
Final estimate: PPL = 1.0086 +/- 0.00053

llama_perf_context_print: load time = 469.24 ms
llama_perf_context_print: prompt eval time = 75470.10 ms / 2560 tokens ( 29.48 ms per token, 33.92 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 75542.77 ms / 2561 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Host | 4952 = 4437 + 256 + 258

PATCH:
perplexity: tokenizing the input ..
perplexity: tokenization took 1.163 ms
perplexity: calculating perplexity over 5 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 49.12 seconds per pass - ETA 1.02 minutes
[1]1.0092,[2]1.0089,[3]1.0090,[4]1.0091,[5]1.0085,
Final estimate: PPL = 1.0085 +/- 0.00053

llama_perf_context_print: load time = 469.40 ms
llama_perf_context_print: prompt eval time = 61442.70 ms / 2560 tokens ( 24.00 ms per token, 41.66 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 61515.70 ms / 2561 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Host | 4952 = 4437 + 256 + 258 |

taronaeo

LGTM. Just a small spacing fix and we're good to push after CI goes green.

ggml/src/ggml-cpu/llamafile/sgemm-ppc.h

small spacing fix Co-authored-by: Aaron Teo <taronaeo@gmail.com>

shalinib-ibm · 2025-12-05T10:17:07Z

@taronaeo Test failed with below error ( not related to the patch I guess). Can you please check when free ?

taronaeo · 2025-12-05T11:41:11Z

@taronaeo Test failed with below error ( not related to the patch I guess). Can you please check when free ?

It's unrelated to this PR. Will continue to merge with master :)

shalinib-ibm requested review from ggerganov and slaren as code owners November 4, 2025 13:48

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. auroralabs-loci/llama.cpp#81

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 4, 2025

ggerganov reviewed Nov 5, 2025

View reviewed changes

shalinib-ibm requested a review from ggerganov November 7, 2025 06:56

shalinib-ibm force-pushed the q8_q4_opt branch 4 times, most recently from 0353908 to c33dffb Compare December 2, 2025 15:01

shalinib-ibm force-pushed the q8_q4_opt branch from c33dffb to 88a9f0b Compare December 3, 2025 08:04

shalinib-ibm force-pushed the q8_q4_opt branch from 88a9f0b to 38077cd Compare December 3, 2025 08:18

ggerganov reviewed Dec 3, 2025

View reviewed changes

ggml/src/ggml-cpu/llamafile/sgemm-ppc.h Show resolved Hide resolved

shalinib-ibm force-pushed the q8_q4_opt branch from 38077cd to f72387f Compare December 3, 2025 09:30

ggerganov approved these changes Dec 3, 2025

View reviewed changes

taronaeo approved these changes Dec 4, 2025

View reviewed changes

ggml/src/ggml-cpu/llamafile/sgemm-ppc.h Outdated Show resolved Hide resolved

Update ggml/src/ggml-cpu/llamafile/sgemm-ppc.h

d775932

small spacing fix Co-authored-by: Aaron Teo <taronaeo@gmail.com>

taronaeo merged commit 3a0d105 into ggml-org:master Dec 5, 2025
87 of 91 checks passed

JayZenith pushed a commit to JayZenith/llama.cpp that referenced this pull request Dec 7, 2025

Q4/Q8 Tiled Gemm Optimization. (ggml-org#16999)

58d0a3b

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

0Marble pushed a commit to 0Marble/llama.cpp that referenced this pull request Dec 18, 2025

Q4/Q8 Tiled Gemm Optimization. (ggml-org#16999)

850bef7

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

Q4/Q8 Tiled Gemm Optimization. (ggml-org#16999)

a117a7f

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

Q4/Q8 Tiled Gemm Optimization. (#16999)

3eaeb20


	float * wdata = (float ) params->wdata + (ne00 + CACHE_LINE_SIZE_F32) ith;

Build Type	`pp128` t/s	Cycles (↓ better)	IPC	Elapsed Time (s)
Base (upstream)	68.08	841B	2.56	13.34
GGML thread patch	52.32	1076B	1.59	16.60
Pthread-based patch	84.27	625B	2.46	11.04

Conversation

shalinib-ibm commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 5, 2025

Uh oh!

ggerganov Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Summary of Thread Model Performance Evaluation (Power10)

Observations

Uh oh!

shalinib-ibm Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm commented Dec 2, 2025

Uh oh!

shalinib-ibm commented Dec 3, 2025

Uh oh!

taronaeo commented Dec 3, 2025

Uh oh!

shalinib-ibm commented Dec 3, 2025

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm commented Dec 3, 2025

Uh oh!

taronaeo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shalinib-ibm commented Dec 5, 2025

Uh oh!

taronaeo commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shalinib-ibm commented Nov 4, 2025 •

edited

Loading

shalinib-ibm Nov 12, 2025 •

edited

Loading

shalinib-ibm Dec 3, 2025 •

edited

Loading

taronaeo left a comment •

edited

Loading