Skip to content

Q4/Q8 Tiled Gemm Optimization.#16999

Merged
taronaeo merged 3 commits intoggml-org:masterfrom
shalinib-ibm:q8_q4_opt
Dec 5, 2025
Merged

Q4/Q8 Tiled Gemm Optimization.#16999
taronaeo merged 3 commits intoggml-org:masterfrom
shalinib-ibm:q8_q4_opt

Conversation

@shalinib-ibm
Copy link
Contributor

@shalinib-ibm shalinib-ibm commented Nov 4, 2025

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul.

10 ~ 30 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).

Make sure to read the contributing guidelines before submitting a PR

This patch implemenrts tiled GEMM for large blocks
where we pack blocks of 64x64 and perfrom matmul.

30 ~ 50 % improvement in llama-bench and llama-batched-bench
with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
@shalinib-ibm
Copy link
Contributor Author

@taronaeo Can you please review this PR ?

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 4, 2025
@shalinib-ibm
Copy link
Contributor Author

@ggerganov Can you please review this PR?

Comment on lines +122 to +145

#include <pthread.h>

typedef vector unsigned char vec_t;
typedef __vector_quad acc_t;

static pthread_key_t t_data_key;
typedef struct {
vec_t* A_pack;
vec_t* B_pack;
int* comparray;
} thread_scratchpad_t;
void thread_cleanup(void* arg) {
thread_scratchpad_t* data = (thread_scratchpad_t*)arg;
if (data) {
delete[] data->A_pack;
delete[] data->B_pack;
delete[] data->comparray;

delete data;
}
}
static bool key_created = false;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to avoid dynamic allocations - none of the code currently uses those. The mechanism for this is to use the wdata from ggml_compute_params to store scratch data. You'll need to reserve the worst-case wsize for your case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Thank you for the input. I tried to avoid dynamic allocation from the code, but lost perforamce without this pthread based code. Below is the code Performance comparison after integrating thread-local scratchpad using wdata.

void matmul_tiled(const ggml_compute_params* params,
     int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) {  
     char* wdata = (char*) params->wdata;
    constexpr size_t ALIGN = 128;
    auto align_ptr = [&](char* ptr, size_t alignment) {
         return (char*)(((uintptr_t)ptr + alignment - 1) & ~(alignment - 1));
    };
    char* ptr = align_ptr(wdata, ALIGN);
    vec_t*  A_pack = (vec_t*)ptr;  ptr += sizeof(vec_t) * mc * kc * 2;
    vec_t*  B_pack = (vec_t*)ptr;  ptr += sizeof(vec_t) * nc * kc * 2;
    int* comparray = (int*)align_ptr(ptr, ALIGN);  // integer part aligned too
    ptr += sizeof(int) * mc * kc;
    // rest of your original matmul_tiled() code unchanged
}

Benchmark (llama-bench) Baseline pthread-based TLS ggml wdata-based TLS
pp128 69 t/s 89 t/s 36 t/s
pp256 69 t/s 94 t/s 36 t/s

This regression is likely due to:

  1. Loss of persistent per-thread cache locality — the previous pthread-based version reused buffers effectively across tiles.
  2. Higher memory initialization or shared buffer contention across threads.

I have also tried static allocation on stack by having just this code , But it has suffers from similar perf loss. ( 38 t/s)
vec_t A_pack [mckc2];
vec_t B_pack[nckc2];
int comparray[mc*kc];

Can you please suggest ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void matmul_tiled()

You wdata pointer has to take into account the thread ID. For example:

float * wdata = (float *) params->wdata + (ne00 + CACHE_LINE_SIZE_F32) * ith;

From what I can tell from your codeblock, all threads are currently working on the same wdata range and most likely they are data racing. Unless I'm missing something 🤔

Copy link
Contributor Author

@shalinib-ibm shalinib-ibm Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Thank you so much for the input. I have implemented the approach you suggested and saw we get best performance with pthread based dynamic memory allocation only . Here is the code and results.

void matmul_tiled(const struct ggml_compute_params* params, int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) {
        const int ith = params->ith;
        const int nth = params->nth;

        const int64_t TILE_SIZE = 64;
        const size_t vec_t_sz = 16;
        const size_t int_sz   = 4;
        const size_t align = (size_t)GGML_CACHE_LINE_SIZE;

        const size_t A_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz;
        const size_t B_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * 2u * vec_t_sz;
        const size_t C_raw_bytes = (size_t)TILE_SIZE * (size_t)TILE_SIZE * int_sz;

        const size_t A_aligned = GGML_PAD(A_raw_bytes, align);
        const size_t B_aligned = GGML_PAD(B_raw_bytes, align);
        const size_t C_aligned = GGML_PAD(C_raw_bytes, align);

        const size_t S_PER_THREAD_MAX = GGML_PAD(A_aligned + B_aligned + C_aligned, align);
        uint8_t* base_u8 = reinterpret_cast<uint8_t*>(params->wdata);
        uint8_t* thread_base_unaligned = base_u8 + (S_PER_THREAD_MAX +(align-1))* (size_t)ith;
        uint8_t* p = (uint8_t*)GGML_PAD((uintptr_t)thread_base_unaligned, align);
        vec_t* A_pack = reinterpret_cast<vec_t*>(p);
        p += A_aligned;
        vec_t* B_pack = reinterpret_cast<vec_t*>(p);
        p += B_aligned;
        int* comparray = reinterpret_cast<int*>(p);
    constexpr bool is_Ablock_q4 = std::is_same_v<TA, block_q4_0>;
    int64_t ytiles = m / mc;
    int64_t xtiles = n / nc;
    int64_t tiles  = xtiles * ytiles;
    int64_t duty = (tiles + nth - 1) / nth;
    int64_t start = duty * ith;
    int64_t end = start + duty;
    if (end > tiles) {
        end = tiles;
    }

    for (int64_t job = start; job < end; ++job) {
        int64_t ii = (job / xtiles) * mc;
        int64_t jj = (job % xtiles) * nc;
        for (int64_t kk = 0; kk < k; kk += kc) {

            if constexpr(is_Ablock_q4) {
                packNormalInt4_large(A + ii*lda + kk, lda, mc, 4, (int8_t*)A_pack, comparray);
            } else {
                packNormal_large<int8_t, vector signed char>(A + ii*lda + kk, lda, mc, 8, (int8_t*)A_pack, false, comparray);
            }
            packNormal_large<uint8_t, vector unsigned char>(B + jj*ldb + kk, ldb, nc, 8, (uint8_t*)B_pack, true);
            KERNEL_Q0(ii, jj, mc, nc, kc, kk, A_pack, B_pack, comparray);
        }
    }
}

Summary of Thread Model Performance Evaluation (Power10)

We compared three builds of llama.cpp on Power10 for the same configuration (Meta-Llama-3-8B Q4_0, 20 threads, prompt 128, 1 token with llama-bench):

Build Type pp128 t/s Cycles (↓ better) IPC Elapsed Time (s)
Base (upstream) 68.08 841B 2.56 13.34
GGML thread patch 52.32 1076B 1.59 16.60
Pthread-based patch 84.27 625B 2.46 11.04

Observations

  • The ggml-thread patch shows a ~25% regression vs base (pp128: 68 → 52 t/s) and a ~28% increase in total cycles, indicating higher synchronization or scheduling overhead.
  • The pthread-based version outperforms both:
    • +24% faster than base for pp128 (84.27 vs 68.08 t/s),
    • ~34% fewer cycles and ~17% lower elapsed time (11.0s vs 13.3s),
    • IPC and cache behavior remain healthy and consistent.

Given these results:

  • The params->wdata approach adds noticeable overhead on Power10.
  • The pthread-based implementation provides clear performance benefits and better scaling with available cores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggerganov
I’ve added a note explaining that the params->wdata approach didn’t provide benefits.
When you have some time, could you please take another look at the patch?
Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed dynamic memory allocation and uploaded new code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shalinib-ibm For my understanding, is the performance now as good as with the previous approach?

Copy link
Contributor Author

@shalinib-ibm shalinib-ibm Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov . Thank you for your time.
Pthread based dynamic memory allocation using malloc was giving better perf by ~ 10%. Below are the results.
./build_base/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 37.81 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 38.89 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 39.05 ± 0.18
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 38.52 ± 0.09
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 37.86 ± 0.02
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg1 13.59 ± 0.02

build: 0bcb40b (6833)
./build_dynamic_alloc/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 37.87 ± 0.03
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 38.93 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 53.33 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 54.35 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 52.98 ± 0.00
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg1 13.60 ± 0.02

build: 2e669d22d (6834)

./build_patch/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 40.88 ± 0.02
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 44.79 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 47.14 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 48.24 ± 0.02
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 47.15 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg1 13.85 ± 0.01

build: 962d4e985 (6836)

@shalinib-ibm shalinib-ibm force-pushed the q8_q4_opt branch 4 times, most recently from 0353908 to c33dffb Compare December 2, 2025 15:01
@shalinib-ibm
Copy link
Contributor Author

@ggerganov I have addressed your review comment. This patch does not use dynamic allocation now.
Kindly review the patch. Thanks in advance for your time.

@shalinib-ibm
Copy link
Contributor Author

@taronaeo Can you please review this patch?

@taronaeo
Copy link
Collaborator

taronaeo commented Dec 3, 2025

@taronaeo Can you please review this patch?

I can review this tomorrow, and can only review it at a high-level. I have no PPC hardware to test this on.

@shalinib-ibm
Copy link
Contributor Author

@taronaeo Can you please review this patch?

I can review this tomorrow, and can only review it at a high-level. I have no PPC hardware to test this on.
@taronaeo These results of llama-bench with for Meta-Llama3-8b Q4 model and granite4.0-h-mico Q8 Model on ppc64le linux power10 box.
Similar gains are observed with llama-batched-bench as well.

/build_base/bin/llama-bench -m /home/shalini/Models/granite-4.0-h-micro-Q8_0.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp16 70.53 ± 0.12
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp32 74.75 ± 0.83
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp64 75.59 ± 0.28
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp128 75.43 ± 0.32
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp256 75.24 ± 0.18
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 tg1 18.80 ± 0.04

build: 0bcb40b (6833)
./build_patch_8cc/bin/llama-bench -m /home/shalini/Models/granite-4.0-h-micro-Q8_0.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp16 77.21 ± 0.22
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp32 86.02 ± 0.09
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp64 90.40 ± 0.04
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp128 96.08 ± 0.06
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp256 98.53 ± 0.01
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 tg1 18.81 ± 0.03

build: 962d4e985 (6836)
./build_base/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 37.83 ± 0.02
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 38.90 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 39.11 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 38.83 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 37.84 ± 0.00
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg1 13.61 ± 0.02

build: 0bcb40b (6833)
./build_patch_8cc/bin/llama-bench -m /home/shalini/Models/Meta-Llama-3-8B/ggml-model-q4.gguf -p 16,32,64,128,256 -n 1

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 40.81 ± 0.03
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 44.73 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 47.09 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 48.20 ± 0.04
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 47.10 ± 0.00
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg1 13.84 ± 0.02

build: 962d4e985 (6836)

./llama-cli -p 'please write a python progarm to print nth fibonacci number' -n 128 -no-cnv ->
gives 27 t/s for PP on patch while base gives 23 t/s for PP.

This commit addresses review comments.
Also, we have saperated out legacy mnpack path
and matmul_tiled paths for tinyBLAS_Q0_PPC class.

10 ~ 30% improvement in PP Speed with Q4_0 and Q8_0 Models.
Tested with Meta-Llama3-8B quatized models with llama-bench,
llama-batched-bench.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you ran perplexity calculations to verify the results are good. Wait for @taronaeo's review

@shalinib-ibm
Copy link
Contributor Author

Make sure you ran perplexity calculations to verify the results are good.

Thank you @ggerganov . Here are the perplexity results run with Meta Llama3-8B q4 model.

yes "The quick brown fox jumps over the lazy dog." | head -n 300 > fox.txt

BASE
perplexity: tokenizing the input ..
perplexity: tokenization took 1.156 ms
perplexity: calculating perplexity over 5 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 60.34 seconds per pass - ETA 1.25 minutes
[1]1.0092,[2]1.0089,[3]1.0091,[4]1.0092,[5]1.0086,
Final estimate: PPL = 1.0086 +/- 0.00053

llama_perf_context_print: load time = 469.24 ms
llama_perf_context_print: prompt eval time = 75470.10 ms / 2560 tokens ( 29.48 ms per token, 33.92 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 75542.77 ms / 2561 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Host | 4952 = 4437 + 256 + 258

PATCH:
perplexity: tokenizing the input ..
perplexity: tokenization took 1.163 ms
perplexity: calculating perplexity over 5 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 49.12 seconds per pass - ETA 1.02 minutes
[1]1.0092,[2]1.0089,[3]1.0090,[4]1.0091,[5]1.0085,
Final estimate: PPL = 1.0085 +/- 0.00053

llama_perf_context_print: load time = 469.40 ms
llama_perf_context_print: prompt eval time = 61442.70 ms / 2560 tokens ( 24.00 ms per token, 41.66 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 61515.70 ms / 2561 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Host | 4952 = 4437 + 256 + 258 |

Copy link
Collaborator

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a small spacing fix and we're good to push after CI goes green.

small spacing fix

Co-authored-by: Aaron Teo <taronaeo@gmail.com>
@shalinib-ibm
Copy link
Contributor Author

@taronaeo Test failed with below error ( not related to the patch I guess). Can you please check when free ?

image

@taronaeo
Copy link
Collaborator

taronaeo commented Dec 5, 2025

@taronaeo Test failed with below error ( not related to the patch I guess). Can you please check when free ?

image

It's unrelated to this PR. Will continue to merge with master :)

@taronaeo taronaeo merged commit 3a0d105 into ggml-org:master Dec 5, 2025
87 of 91 checks passed
JayZenith pushed a commit to JayZenith/llama.cpp that referenced this pull request Dec 7, 2025
0Marble pushed a commit to 0Marble/llama.cpp that referenced this pull request Dec 18, 2025
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants