Q4/Q8 Tiled Gemm Optimization.#16999
Conversation
This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul. 30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0). Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
|
@taronaeo Can you please review this PR ? |
|
@ggerganov Can you please review this PR? |
|
|
||
| #include <pthread.h> | ||
|
|
||
| typedef vector unsigned char vec_t; | ||
| typedef __vector_quad acc_t; | ||
|
|
||
| static pthread_key_t t_data_key; | ||
| typedef struct { | ||
| vec_t* A_pack; | ||
| vec_t* B_pack; | ||
| int* comparray; | ||
| } thread_scratchpad_t; | ||
| void thread_cleanup(void* arg) { | ||
| thread_scratchpad_t* data = (thread_scratchpad_t*)arg; | ||
| if (data) { | ||
| delete[] data->A_pack; | ||
| delete[] data->B_pack; | ||
| delete[] data->comparray; | ||
|
|
||
| delete data; | ||
| } | ||
| } | ||
| static bool key_created = false; | ||
|
|
There was a problem hiding this comment.
It would be better to avoid dynamic allocations - none of the code currently uses those. The mechanism for this is to use the wdata from ggml_compute_params to store scratch data. You'll need to reserve the worst-case wsize for your case.
There was a problem hiding this comment.
@ggerganov Thank you for the input. I tried to avoid dynamic allocation from the code, but lost perforamce without this pthread based code. Below is the code Performance comparison after integrating thread-local scratchpad using wdata.
void matmul_tiled(const ggml_compute_params* params,
int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) {
char* wdata = (char*) params->wdata;
constexpr size_t ALIGN = 128;
auto align_ptr = [&](char* ptr, size_t alignment) {
return (char*)(((uintptr_t)ptr + alignment - 1) & ~(alignment - 1));
};
char* ptr = align_ptr(wdata, ALIGN);
vec_t* A_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * mc * kc * 2;
vec_t* B_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * nc * kc * 2;
int* comparray = (int*)align_ptr(ptr, ALIGN); // integer part aligned too
ptr += sizeof(int) * mc * kc;
// rest of your original matmul_tiled() code unchanged
}
| Benchmark (llama-bench) | Baseline | pthread-based TLS | ggml wdata-based TLS |
|---|---|---|---|
| pp128 | 69 t/s | 89 t/s | 36 t/s |
| pp256 | 69 t/s | 94 t/s | 36 t/s |
This regression is likely due to:
- Loss of persistent per-thread cache locality — the previous pthread-based version reused buffers effectively across tiles.
- Higher memory initialization or shared buffer contention across threads.
I have also tried static allocation on stack by having just this code , But it has suffers from similar perf loss. ( 38 t/s)
vec_t A_pack [mckc2];
vec_t B_pack[nckc2];
int comparray[mc*kc];
Can you please suggest ?
There was a problem hiding this comment.
void matmul_tiled()
You wdata pointer has to take into account the thread ID. For example:
llama.cpp/ggml/src/ggml-cpu/ops.cpp
Lines 615 to 617 in 7f09a68
From what I can tell from your codeblock, all threads are currently working on the same wdata range and most likely they are data racing. Unless I'm missing something 🤔
There was a problem hiding this comment.
@ggerganov Thank you so much for the input. I have implemented the approach you suggested and saw we get best performance with pthread based dynamic memory allocation only . Here is the code and results.
void matmul_tiled(const struct ggml_compute_params* params, int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) {
const int ith = params->ith;
const int nth = params->nth;
const int64_t TILE_SIZE = 64;
const size_t vec_t_sz = 16;
const size_t int_sz = 4;
const size_t align = (size_t)GGML_CACHE_LINE_SIZE;
const size_t A_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz;
const size_t B_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz;
const size_t C_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * int_sz;
const size_t A_aligned = GGML_PAD(A_raw_bytes, align);
const size_t B_aligned = GGML_PAD(B_raw_bytes, align);
const size_t C_aligned = GGML_PAD(C_raw_bytes, align);
const size_t S_PER_THREAD_MAX = GGML_PAD(A_aligned + B_aligned + C_aligned, align);
uint8_t* base_u8 = reinterpret_cast<uint8_t*>(params->wdata);
uint8_t* thread_base_unaligned = base_u8 + (S_PER_THREAD_MAX +(align-1))* (size_t)ith;
uint8_t* p = (uint8_t*)GGML_PAD((uintptr_t)thread_base_unaligned, align);
vec_t* A_pack = reinterpret_cast<vec_t*>(p);
p += A_aligned;
vec_t* B_pack = reinterpret_cast<vec_t*>(p);
p += B_aligned;
int* comparray = reinterpret_cast<int*>(p);
constexpr bool is_Ablock_q4 = std::is_same_v<TA, block_q4_0>;
int64_t ytiles = m / mc;
int64_t xtiles = n / nc;
int64_t tiles = xtiles * ytiles;
int64_t duty = (tiles + nth - 1) / nth;
int64_t start = duty * ith;
int64_t end = start + duty;
if (end > tiles) {
end = tiles;
}
for (int64_t job = start; job < end; ++job) {
int64_t ii = (job / xtiles) * mc;
int64_t jj = (job % xtiles) * nc;
for (int64_t kk = 0; kk < k; kk += kc) {
if constexpr(is_Ablock_q4) {
packNormalInt4_large(A + ii*lda + kk, lda, mc, 4, (int8_t*)A_pack, comparray);
} else {
packNormal_large<int8_t, vector signed char>(A + ii*lda + kk, lda, mc, 8, (int8_t*)A_pack, false, comparray);
}
packNormal_large<uint8_t, vector unsigned char>(B + jj*ldb + kk, ldb, nc, 8, (uint8_t*)B_pack, true);
KERNEL_Q0(ii, jj, mc, nc, kc, kk, A_pack, B_pack, comparray);
}
}
}
Summary of Thread Model Performance Evaluation (Power10)
We compared three builds of llama.cpp on Power10 for the same configuration (Meta-Llama-3-8B Q4_0, 20 threads, prompt 128, 1 token with llama-bench):
| Build Type | pp128 t/s |
Cycles (↓ better) | IPC | Elapsed Time (s) |
|---|---|---|---|---|
| Base (upstream) | 68.08 | 841B | 2.56 | 13.34 |
| GGML thread patch | 52.32 | 1076B | 1.59 | 16.60 |
| Pthread-based patch | 84.27 | 625B | 2.46 | 11.04 |
Observations
- The ggml-thread patch shows a ~25% regression vs base (pp128: 68 → 52 t/s) and a ~28% increase in total cycles, indicating higher synchronization or scheduling overhead.
- The pthread-based version outperforms both:
- +24% faster than base for
pp128(84.27 vs 68.08 t/s), - ~34% fewer cycles and ~17% lower elapsed time (11.0s vs 13.3s),
- IPC and cache behavior remain healthy and consistent.
- +24% faster than base for
Given these results:
- The params->wdata approach adds noticeable overhead on Power10.
- The pthread-based implementation provides clear performance benefits and better scaling with available cores.
There was a problem hiding this comment.
Hi @ggerganov
I’ve added a note explaining that the params->wdata approach didn’t provide benefits.
When you have some time, could you please take another look at the patch?
Thank you!
There was a problem hiding this comment.
Removed dynamic memory allocation and uploaded new code.
There was a problem hiding this comment.
@shalinib-ibm For my understanding, is the performance now as good as with the previous approach?
There was a problem hiding this comment.
@ggerganov . Thank you for your time.
Pthread based dynamic memory allocation using malloc was giving better perf by ~ 10%. Below are the results.
./build_base/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp16 | 37.81 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp32 | 38.89 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp64 | 39.05 ± 0.18 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp128 | 38.52 ± 0.09 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp256 | 37.86 ± 0.02 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | tg1 | 13.59 ± 0.02 |
build: 0bcb40b (6833)
./build_dynamic_alloc/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp16 | 37.87 ± 0.03 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp32 | 38.93 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp64 | 53.33 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp128 | 54.35 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp256 | 52.98 ± 0.00 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | tg1 | 13.60 ± 0.02 |
build: 2e669d22d (6834)
./build_patch/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp16 | 40.88 ± 0.02 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp32 | 44.79 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp64 | 47.14 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp128 | 48.24 ± 0.02 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | pp256 | 47.15 ± 0.01 |
| llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 10 | tg1 | 13.85 ± 0.01 |
build: 962d4e985 (6836)
0353908 to
c33dffb
Compare
|
@ggerganov I have addressed your review comment. This patch does not use dynamic allocation now. |
|
@taronaeo Can you please review this patch? |
I can review this tomorrow, and can only review it at a high-level. I have no PPC hardware to test this on. |
c33dffb to
88a9f0b
Compare
/build_base/bin/llama-bench -m /home/shalini/Models/granite-4.0-h-micro-Q8_0.gguf -p 16,32,64,128,256 -n 1
build: 0bcb40b (6833)
build: 962d4e985 (6836)
build: 0bcb40b (6833)
build: 962d4e985 (6836) ./llama-cli -p 'please write a python progarm to print nth fibonacci number' -n 128 -no-cnv -> |
88a9f0b to
38077cd
Compare
This commit addresses review comments. Also, we have saperated out legacy mnpack path and matmul_tiled paths for tinyBLAS_Q0_PPC class. 10 ~ 30% improvement in PP Speed with Q4_0 and Q8_0 Models. Tested with Meta-Llama3-8B quatized models with llama-bench, llama-batched-bench. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
38077cd to
f72387f
Compare
Thank you @ggerganov . Here are the perplexity results run with Meta Llama3-8B q4 model. yes "The quick brown fox jumps over the lazy dog." | head -n 300 > fox.txt BASE llama_perf_context_print: load time = 469.24 ms PATCH: llama_perf_context_print: load time = 469.40 ms |
small spacing fix Co-authored-by: Aaron Teo <taronaeo@gmail.com>
|
@taronaeo Test failed with below error ( not related to the patch I guess). Can you please check when free ?
|
It's unrelated to this PR. Will continue to merge with master :) |

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul.
10 ~ 30 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).
Make sure to read the contributing guidelines before submitting a PR