Metal TQ2_0 by dmahurin · Pull Request #12485 · ggml-org/llama.cpp

dmahurin · 2025-03-20T21:51:24Z

Support for TQ2_0 on Metal.

This a commit by @compilade from last year, re-applied to current.

Run with:

llama-cli -m "$(huggingface-cli download basavyr/TriLM_3.9B_Unpacked_quantized TriLM_3.9B_Unpacked_quant_TQ2_0.gguf)" -p The

llama-cli -m "$(huggingface-cli download brunopio/Llama3-8B-1.58-100B-tokens-GGUF Llama3-8B-1.58-100B-tokens-TQ2_0.gguf)"

The result runs and the result seems similar to that of the original commit by @compilade.

Though the result is not great compared to 4bit Llama 8B. Perhaps someone can compare with non-metal result.

Mostly adapted from the IQ2_TN kernels from ikawrakow/ik_llama.cpp#13 which were themselves adapted from the Q2_K kernels.

…nto structs (ggml-org#10238)'

ggerganov · 2025-03-21T09:09:34Z

Few updates by just pattern matching with the Q2_K kernel:

diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
index 8ac60744..a068e84c 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -5075,15 +5075,15 @@ void kernel_mul_mv_tq2_0_f32_impl(
     const int im = tgpig.z;
 
     const int first_row = (r0 * N_SIMDGROUP + sgitg) * N_DST;
-    const int ib_row = first_row * nb;
 
     const uint i12 = im%args.ne12;
     const uint i13 = im/args.ne12;
 
-    const uint offset0 = (i12/args.r2)*(nb*args.ne01) + (i13/args.r3)*(nb*args.ne01*args.ne02);
+    const uint64_t offset0 = first_row*args.nb01 + (i12/args.r2)*args.nb02 + (i13/args.r3)*args.nb03;
+    const uint64_t offset1 =        r1*args.nb11 + (i12        )*args.nb12 + (i13        )*args.nb13;
 
-    device const block_tq2_0 * x = (device const block_tq2_0 *) src0 + ib_row + offset0;
-    device const float       * y = (device const float       *) src1 + r1*args.ne10 + im*args.ne00*args.ne1;
+    device const block_tq2_0 * x = (device const block_tq2_0 *) (src0 + offset0);
+    device const float       * y = (device const float       *) (src1 + offset1);
 
     float yl[32];
     float sumf[N_DST]={0.f}, all_sum;
@@ -5139,7 +5139,7 @@ void kernel_mul_mv_tq2_0_f32_impl(
 
     device float * dst_f32 = (device float *) dst + (uint64_t)im*args.ne0*args.ne1 + (uint64_t)r1*args.ne0;
 
-    for (int row = 0; row < N_DST; ++row) {
+    for (int row = 0; row < N_DST && first_row + row < args.ne0; ++row) {
         all_sum = simd_sum(sumf[row]);
         if (tiisg == 0) {
             dst_f32[first_row + row] = all_sum;

Haven't performed any tests so double-check if this makes sense.

ikawrakow · 2025-03-21T10:56:03Z

This a commit by @compilade from last year, re-applied to current.

The Metal implementation actually came from here. See this comment by @compilade.

dmahurin · 2025-03-30T14:16:13Z

Hi @ggerganov

I updated the branch with your changes.
The changes compile and run. I could not quite tell if result is any worse or equivalent.

While I understand replacing ib_row with first_row * nb, and the the bounds check,
there are other changes that are less obvious (to me), that you could perhaps explain.

-    const uint64_t offset0 = first_row * nb + (i12/args.r2)*(nb*args.ne01) + (i13/args.r3)*(nb*args.ne01*args.ne02);
-    const uint64_t offset1 = r1*args.ne10 + im*args.ne00*args.ne1;
+    const uint64_t offset0 = first_row*args.nb01 + (i12/args.r2)*args.nb02 + (i13/args.r3)*args.nb03;
+    const uint64_t offset1 =        r1*args.nb11 + (i12        )*args.nb12 + (i13        )*args.nb13;

compilade and others added 2 commits March 20, 2025 14:24

metal : support TQ2_0

613a7c5

Mostly adapted from the IQ2_TN kernels from ikawrakow/ik_llama.cpp#13 which were themselves adapted from the Q2_K kernels.

metal: For TQ2_0, Apply changes from: 'metal : refactor kernel args i…

f12f803

…nto structs (ggml-org#10238)'

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 20, 2025

A few updates by just pattern matching with the Q2_K kernel

f90131e

compilade mentioned this pull request Mar 25, 2025

ggml-quants : weighted rounding algorithms with cumulative search #12557

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal TQ2_0#12485

Metal TQ2_0#12485
dmahurin wants to merge 3 commits intoggml-org:masterfrom
dmahurin:metal-tq2_0

dmahurin commented Mar 20, 2025

Uh oh!

ggerganov commented Mar 21, 2025

Uh oh!

ikawrakow commented Mar 21, 2025

Uh oh!

dmahurin commented Mar 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dmahurin commented Mar 20, 2025

Uh oh!

ggerganov commented Mar 21, 2025

Uh oh!

ikawrakow commented Mar 21, 2025

Uh oh!

dmahurin commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dmahurin commented Mar 30, 2025 •

edited

Loading