feat: CUDA port of TurboQuant3 KV cache — 3.47x compression, 98.5% of F16 decode speed on RTX 5090 by signalnine · Pull Request #3 · TheTom/llama-cpp-turboquant

signalnine · 2026-03-27T00:28:51Z

Summary

This PR ports TurboQuant3 (turbo3) KV cache compression to CUDA, targeting SM 12.0 (RTX 5090 / Blackwell) with near-parity decode performance vs F16.

What's included

CUDA kernel port (set-rows.cu, turbo-quant.cuh, turbo-wht.cu/cuh):

k_set_rows_turbo3: quantises incoming F32 KV tokens into turbo3 blocks on GPU
Fully parallel design: one block per 128-element group, 128 threads/block
WHT via shared-memory butterfly (7 stages), L2 norm via warp reduction
Bit packing with __shfl_sync (qs) and __ballot_sync (signs) — no atomics
Reconstruction norm corrected for quantisation error before writing

Flash attention integration (fattn-common.cuh, fattn-vec.cuh, fattn.cu):

vec_dot_fattn_vec_KQ_turbo3_0: optimised KQ dot product — elem0/elem1 always share the same turbo3 block, so qs/signs loaded once per pair instead of twice
dequantize_V_turbo3_0: ne==4 fast path — single qs byte + single signs byte covers all 4 elements; unrolled float2/half2 output
Routes decode (Q→ne[1] ≤ 2) through VEC flash attention kernel on Ada/Blackwell (CC ≥ 890)

Bug fix — VEC kernel Q/K stride mismatch:

vec_dot_fattn_vec_KQ_turbo3_0 originally stepped the outer K loop by nthreads (=8) but Q registers are loaded in blocks of nthreads * cpy_ne (=32). Thread t at step s accessed K element 16s+2t but Q element 64*(s/4)+8t+2*(s%4) — matching only when t=0, s%4=0. Fixed by matching the f16 kernel: step by nthreads*cpy_ne, add inner k_KQ_1 loop. The MMA kernel (prefill) was unaffected.

Quality gate / auto-enable (llama.cpp, ggml-cuda.cu):

Flash attention auto-enabled when turbo cache types are detected
ggml_context overflow fix for large KV cache allocations

Benchmark results (Qwen3.5 35B A3B Q4_K_M, RTX 5090, tg128)

KV cache	Decode (t/s)	vs F16
F16	95.4	1.00×
q8_0	95.7	1.00×
turbo3	94.0	0.985×

Memory: 3.47× compression vs F16 (3-bit vs 16-bit KV cache)

NIAH results (single-needle, Kamradt/RULER methodology)

Depth positions 0–100% at each context length. Score = correct needle retrievals / 11.

Depth	4K	8K	16K	32K	64K	128K	256K	1024K
0%	✅	✅	✅	✅	✅	✅	✅	✅
10%	✅	✅	✅	✅	✅	✅	✅	✅
20%	✅	✅	✅	✅	✅	❌	✅	✅
30%	✅	✅	✅	✅	✅	✅	✅	✅
40%	✅	✅	✅	✅	✅	✅	✅	✅
50%	✅	✅	✅	✅	✅	✅	✅	✅
60%	✅	✅	✅	✅	✅	✅	✅	✅
70%	✅	✅	✅	✅	✅	✅	✅	✅
80%	✅	✅	✅	✅	✅	✅	✅	✅
90%	✅	✅	✅	✅	✅	✅	✅	✅
100%	✅	✅	✅	✅	✅	✅	✅	✅
Score	11/11	11/11	11/11	11/11	11/11	10/11	11/11	11/11

q8_0 baseline: 11/11 at 4K–64K, 10/11 at 128K, 10/11 at 256K. turbo3 matches or exceeds q8_0 at every length. The single miss at 128K (depth 20%) matches the q8_0 miss — not a turbo3 regression.

1M context tested with YaRN 4× rope scaling (--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144). q8_0 cannot fit at 1M alongside the model on a 32 GB card; turbo3 KV (~4.5 GB) fits with ~4 GB to spare.

Key design choices

Group size = 128 (one WHT per head-dim for typical 128-dim heads), matching the Python reference
Norm correction: stores grp_norm / recon_norm (not raw grp_norm) in the half-precision norm field so dequant is a single multiply
__launch_bounds__(128) on the quantisation kernel prevents spilling with the large shared memory footprint

🤖 Generated with Claude Code

@short

Ports the Metal turbo3 implementation to NVIDIA CUDA. End-to-end working on RTX 5090 with Qwen3.5 35B A3B Q4_K_M: 3.47x KV compression at 32K context, ~4x max context extension (256K → 1M tokens on 32GB VRAM). New files: - ggml-cuda/turbo-quant.cuh — block_turbo3_0 layout, WHT sign arrays, 3-bit centroid LUT, dequant helpers, quantize kernel (set_rows path) - ggml-cuda/turbo-wht.cu/.cuh — GGML_OP_TURBO_WHT CUDA kernel; 128-thread blocks, in-place butterfly in shared memory, forward + inverse WHT via compile-time template - ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo3_0.cu — VEC flash-attention instance for D=64/128/256 Modified files: - dequantize.cuh — dequantize_turbo3_0 (produces float2 pairs) - convert.cu — all 5 to-fp16/fp32 dispatchers - fattn-common.cuh — vec_dot_fattn_vec_KQ_turbo3_0, dequantize_V_turbo3_0, dispatcher extensions - fattn-vec.cuh — turbo3 treated as unquantized (f16-style nthreads_KQ) - fattn.cu — route turbo3 exclusively to VEC kernel; add dispatch macro - set-rows.cu — k_set_rows_turbo3 kernel: per-128-elem group quantization with WHT rotation and norm correction - ggml-cuda.cu — supports_op + compute dispatch for TURBO_WHT + SET_ROWS - llama-kv-cache.cpp — +2 tensor overhead for rotation matrices Benchmark (RTX 5090, Qwen3.5 35B A3B Q4_K_M, FA on): KV memory @32k: 702 MiB (f16) → 202 MiB (turbo3) = 3.47x compression Max context: ~256K (f16) → ~1M (turbo3) = ~4x extension Decode @short: 233 t/s (q8_0) → 190 t/s (turbo3) = 0.82x Prefill @32k: 6335 t/s (q8_0) → 1215 t/s (turbo3) = 0.19x Note: prefill speed degrades significantly vs Metal (Metal: 0.99x q8_0 at all contexts; CUDA: 0.19x at 32K). Root cause: turbo3 currently uses the VEC flash-attention kernel; q8_0 uses the more efficient TILE/MMA kernels at long context. TILE/MMA support for turbo3 is the next milestone. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove the 5-line early-return that forced turbo3 onto the VEC flash attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1) via the existing dispatch logic, but prefill now goes through the Turing MMA kernel (RTX 5090 is SM 12.0 >> 7.5). launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 — added in the original CUDA port commit — provide that conversion automatically. Stride recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). ✓ Before (VEC only): After (MMA for prefill): 2K prefill: 5032 t/s (0.73× q8_0) 6734 t/s (0.98× q8_0) 8K prefill: 3110 t/s (0.46× q8_0) 6613 t/s (0.98× q8_0) 32K prefill: 1215 t/s (0.19× q8_0) 6168 t/s (0.97× q8_0) Matches Metal M5 Max result (0.99× q8_0 flat across all context sizes). Decode unchanged (VEC, ~0.64-0.82× q8_0 depending on context depth). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e (71→94 t/s, 98.5% of F16) k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel gave 3.1% GPU utilisation (36.5 µs × 80 calls/token = ~21% of decode budget). Replace with a fully parallel kernel — 1 block per 128-element group, 128 threads per block (one thread per element): • Shared-memory WHT butterfly (7 stages, no atomics) • Warp-reduce L2 norm + inter-warp accumulate via smem • qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync • Reconstruction norm same pattern; one write per sub-block (warp lane 0) Also tighten flash-attention dequant paths (fattn-common.cuh): • vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same turbo3 block — load qs/signs once instead of twice per pair • dequantize_V_turbo3_0: ne==4 fast path — load one qs byte and one signs byte for all 4 elements; unrolled float2 / half2 output pairs Benchmark (Qwen3.5 35B, RTX 5090, tg128): Before: 71.86 t/s (0.75× q8_0) After: 94.04 t/s (0.985× q8_0, within measurement noise of parity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed TheTom#3 (TURBO_D). TheTom#1 and TheTom#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

signalnine · 2026-03-27T05:53:23Z

Looks like I've got some quality issues with larger context windows, converting to draft while I work those out.

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by `nthreads` (8), but the Q register file is loaded in blocks of `nthreads*cpy_ne` (32) elements per thread — the same pattern used by the f16/bf16 VEC kernels. This caused thread t>0 to pair K element (16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete index mismatch. Every generated token had garbage attention scores. Fix: match the f16 kernel pattern — step by nthreads*cpy_ne, add an inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as Q_v[k_KQ_0/nthreads + k_KQ_1]. Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp that documented the symptom of this bug. Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M: - PPL (wikitext-2): 6.2023 → 6.2996 (+1.57%, within 5% target) - NIAH: 11/11 at 4K–256K (matches q8_0; was 0/11 before fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

signalnine · 2026-03-27T17:08:22Z

Bug fix: VEC flash-attention Q/K stride mismatch (commit `4c91451`)

Root cause

vec_dot_fattn_vec_KQ_turbo3_0 in fattn-common.cuh had the wrong outer loop stride. It stepped k_KQ_0 by nthreads (8), but Q registers are loaded in blocks of nthreads*cpy_ne (32) elements per thread — the same pattern the f16/bf16 VEC kernels use.

This caused thread t > 0 to pair K element 16s + 2t against Q element 64*(s/4) + 8t + 2*(s%4). For example, thread 1 paired K[2] with Q[8] instead of Q[2]. The f16 kernel avoids this by stepping its outer loop by nthreads*cpy_ne and processing cpy_ne K elements per thread per iteration.

This kernel is used for all generation steps (n_tokens ≤ 2), so every generated token had garbage attention scores. Prefill (MMA kernel path) was unaffected.

Fix (3 lines changed)

// Before — wrong stride, Q/K indices misaligned for thread t > 0:
for (int k_KQ_0 = 0; k_KQ_0 < D/2; k_KQ_0 += nthreads) {
    const int k_KQ = k_KQ_0 + (threadIdx.x % nthreads);
    ...
    const float2 qv = ((const float2 *) Q_v)[k_KQ_0/nthreads];

// After — matches f16 kernel pattern:
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
for (int k_KQ_0 = 0; k_KQ_0 < D/2; k_KQ_0 += nthreads*cpy_ne) {
    for (int k_KQ_1 = 0; k_KQ_1 < cpy_ne; ++k_KQ_1) {
        const int k_KQ = k_KQ_0 + (threadIdx.x % nthreads)*cpy_ne + k_KQ_1;
        ...
        const float2 qv = ((const float2 *) Q_v)[k_KQ_0/nthreads + k_KQ_1];

Test results (RTX 5090, Qwen3.5-35B-A3B-Q4_K_M)

Perplexity (wikitext-2, 512 ctx):

	PPL
Baseline (no compression)	6.2023
turbo3 after fix	6.2996 (+1.57%)
turbo3 before fix	~23.5 (garbage)

NIAH single-needle (11 depth points, q8_0 vs turbo3):

	4K	8K	16K	32K	64K	128K	256K
q8_0	11/11	11/11	11/11	10/11	10/11	9/11	10/11
turbo3	11/11	11/11	11/11	10/11	9/11	10/11	11/11

Aggregate score is identical (turbo3 = q8_0). The few single-cell differences are in opposite directions and consistent with model-level retrieval variance, not KV compression degradation.

Speed (llama-bench):

	q8_0	turbo3	ratio
Prefill 4K	6947 t/s	6853 t/s	98.6%
Prefill 32K	6380 t/s	6301 t/s	98.8%
Prefill 128K	4731 t/s	4711 t/s	99.6%
Generation	218 t/s	207 t/s	95.0%

Prefill is at parity. Generation is ~5% slower due to the centroid lookup overhead — this model is compute-bound during decoding (Q4_K_M MoE weights dominate bandwidth), so the 3.47x smaller KV cache doesn't help here. On a weight-fp16 or large-batch workload the bandwidth savings would show.

dan-and · 2026-03-27T21:48:42Z

Thats interesting. I will give it a try. I gave the fork from Madreag/turbo3-cuda a run tonight and it had also issues at large context sizes.

q8

llama-benchy --base-url http://127.0.01:18080 --model llamacpp-model --depth 0 4096 8192 16384 204800  --tg 32 128 --latency-mode generation


| model          |             test |             t/s |     peak t/s |             ttfr (ms) |          est_ppt (ms) |         e2e_ttft (ms) |
|:---------------|-----------------:|----------------:|-------------:|----------------------:|----------------------:|----------------------:|
| llamacpp-model |           pp2048 | 1282.76 ± 39.17 |              |       1593.28 ± 30.46 |       1439.92 ± 30.46 |       1593.33 ± 30.46 |
| llamacpp-model |             tg32 |    63.62 ± 0.46 | 65.71 ± 0.48 |                       |                       |                       |
| llamacpp-model |           pp2048 | 1153.16 ± 50.08 |              |       1752.58 ± 46.67 |       1599.21 ± 46.67 |       1752.62 ± 46.67 |
| llamacpp-model |            tg128 |    62.00 ± 0.28 | 62.67 ± 0.47 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d4096 | 1212.91 ± 52.05 |              |      4744.25 ± 242.83 |      4590.88 ± 242.83 |      4744.29 ± 242.83 |
| llamacpp-model |     tg32 @ d4096 |    60.08 ± 0.20 | 62.05 ± 0.20 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d4096 | 1057.25 ± 18.41 |              |      5444.91 ± 207.05 |      5291.54 ± 207.05 |      5444.95 ± 207.05 |
| llamacpp-model |    tg128 @ d4096 |    58.47 ± 0.52 | 59.67 ± 0.47 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d8192 |  988.52 ± 36.40 |              |      9567.42 ± 275.26 |      9414.05 ± 275.26 |      9567.46 ± 275.26 |
| llamacpp-model |     tg32 @ d8192 |    57.53 ± 0.88 | 59.45 ± 0.90 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d8192 |  878.16 ± 20.97 |              |     10821.39 ± 312.45 |     10668.03 ± 312.45 |     10821.43 ± 312.45 |
| llamacpp-model |    tg128 @ d8192 |    55.67 ± 0.21 | 57.00 ± 0.00 |                       |                       |                       |
| llamacpp-model |  pp2048 @ d16384 |  797.96 ± 49.51 |              |    21042.87 ± 1204.39 |    20889.50 ± 1204.39 |    21042.91 ± 1204.39 |
| llamacpp-model |    tg32 @ d16384 |    53.09 ± 1.42 | 54.86 ± 1.47 |                       |                       |                       |
| llamacpp-model |  pp2048 @ d16384 |  661.82 ± 18.54 |              |     25294.80 ± 747.67 |     25141.44 ± 747.67 |     25294.84 ± 747.67 |
| llamacpp-model |   tg128 @ d16384 |    49.39 ± 0.13 | 51.00 ± 0.00 |                       |                       |                       |
| llamacpp-model | pp2048 @ d204800 |  281.94 ± 66.24 |              | 698075.03 ± 140499.74 | 697921.66 ± 140499.74 | 698075.09 ± 140499.73 |
| llamacpp-model |   tg32 @ d204800 |    29.33 ± 0.67 | 30.00 ± 0.82 |                       |                       |                       |
| llamacpp-model | pp2048 @ d204800 |   230.29 ± 2.67 |              |   814309.55 ± 9528.99 |   814156.18 ± 9528.99 |   814309.59 ± 9528.99 |
| llamacpp-model |  tg128 @ d204800 |    28.13 ± 0.06 | 29.00 ± 0.00 |                       |                       |                       |

llama-benchy (0.3.2.dev1+g17b42667a)
date: 2026-03-27 18:08:57 | latency mode: generation

CUDA_VISIBLE_DEVICES=0,1,2,3 build/bin/llama-server --webui-mcp-proxy --alias llamacpp-model -m ../models/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --kv-unified -ctk q8_0 -ctv q8_0 --swa-full --presence-penalty 1.5 --repeat-penalty 1.0 --ctx-size 260000 -fa on --no-mmap --jinja --threads -1 --reasoning on --metrics --host 0.0.0.0 --port 18080 --alias llamacpp-model


llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3080)   | 20054 = 3910 + (15352 = 12367 +     615 +    2370) +         792 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 3080)   | 20054 = 6064 + (13196 = 11228 +     868 +    1100) +         794 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 3080)   | 20054 = 6314 + (12947 = 11240 +     606 +    1100) +         793 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 3080)   | 20054 = 6260 + (13001 = 10616 +     859 +    1525) +         793 |
llama_memory_breakdown_print: |   - Host               |                  3010 =   970 +       0 +    2040                |

Madreag/turbo3-cuda turboquant

CUDA_VISIBLE_DEVICES=0,1,2,3 build/bin/llama-server --webui-mcp-proxy --alias llamacpp-model -m ../models/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --kv-unified -ctk turbo3 -ctv turbo3 --swa-full --presence-penalty 1.5 --repeat-penalty 1.0 --ctx-size 260000 -fa on --no-mmap --jinja --threads -1 --reasoning on --metrics --host 0.0.0.0 --port 18080 --alias llamacpp-model

| model          |             test |             t/s |     peak t/s |             ttfr (ms) |          est_ppt (ms) |         e2e_ttft (ms) |
|:---------------|-----------------:|----------------:|-------------:|----------------------:|----------------------:|----------------------:|
| llamacpp-model |           pp2048 | 1287.54 ± 66.23 |              |       1528.95 ± 58.07 |       1434.45 ± 58.07 |       1528.99 ± 58.07 |
| llamacpp-model |             tg32 |    53.96 ± 1.86 | 55.72 ± 1.91 |                       |                       |                       |
| llamacpp-model |           pp2048 | 1106.94 ± 53.64 |              |       1744.38 ± 48.83 |       1649.88 ± 48.83 |       1744.42 ± 48.82 |
| llamacpp-model |            tg128 |    48.93 ± 0.09 | 50.00 ± 0.00 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d4096 | 1189.36 ± 46.89 |              |      4849.93 ± 249.32 |      4755.43 ± 249.32 |      4849.97 ± 249.32 |
| llamacpp-model |     tg32 @ d4096 |    43.29 ± 1.98 | 44.70 ± 2.04 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d4096 | 1040.50 ± 22.35 |              |       5517.25 ± 61.20 |       5422.75 ± 61.20 |       5517.29 ± 61.20 |
| llamacpp-model |    tg128 @ d4096 |    36.40 ± 0.05 | 38.00 ± 0.00 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d8192 |  981.78 ± 33.00 |              |      9510.98 ± 489.26 |      9416.48 ± 489.26 |      9511.02 ± 489.26 |
| llamacpp-model |     tg32 @ d8192 |    33.75 ± 1.54 | 34.85 ± 1.59 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d8192 |  873.38 ± 19.11 |              |     10607.80 ± 207.11 |     10513.30 ± 207.11 |     10607.85 ± 207.11 |
| llamacpp-model |    tg128 @ d8192 |    28.35 ± 0.55 | 30.33 ± 0.47 |                       |                       |                       |
| llamacpp-model |  pp2048 @ d16384 |  799.01 ± 50.18 |              |    20877.31 ± 1192.24 |    20782.81 ± 1192.24 |    20877.35 ± 1192.24 |
| llamacpp-model |    tg32 @ d16384 |    25.43 ± 1.50 | 26.00 ± 1.63 |                       |                       |                       |
| llamacpp-model |  pp2048 @ d16384 |  663.46 ± 16.78 |              |     25337.12 ± 764.88 |     25242.63 ± 764.88 |     25337.17 ± 764.89 |
| llamacpp-model |   tg128 @ d16384 |    19.75 ± 0.32 | 21.33 ± 0.47 |                       |                       |                       |
| llamacpp-model | pp2048 @ d204800 |  281.80 ± 66.36 |              | 698821.15 ± 142220.41 | 698726.65 ± 142220.41 | 698821.21 ± 142220.39 |
| llamacpp-model |   tg32 @ d204800 |     5.36 ± 0.27 |  6.00 ± 0.00 |                       |                       |                       |
| llamacpp-model | pp2048 @ d204800 |   235.28 ± 2.53 |              |   796568.05 ± 8491.18 |   796473.55 ± 8491.18 |   796568.10 ± 8491.18 |
| llamacpp-model |  tg128 @ d204800 |     5.38 ± 0.23 |  6.00 ± 0.00 |                       |                       |                       |

llama-benchy (0.3.2.dev1+g17b42667a)
date: 2026-03-27 19:55:00 | latency mode: generation

llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3080)   | 20054 = 4218 + (15042 = 12367 +     297 +    2378) +         794 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 3080)   | 20054 = 6540 + (12720 = 11228 +     392 +    1100) +         794 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 3080)   | 20054 = 6630 + (12629 = 11240 +     289 +    1100) +         795 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 3080)   | 20054 = 6736 + (12525 = 10616 +     383 +    1525) +         793 |
llama_memory_breakdown_print: |   - Host               |                  3010 =   970 +       0 +    2040                |

signalnine/llama-cpp-turboquant (this PR)

CUDA_VISIBLE_DEVICES=0,1,2,3 build/bin/llama-server --webui-mcp-proxy --alias llamacpp-model -m ../models/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --kv-unified -ctk turbo3 -ctv turbo3 --swa-full --presence-penalty 1.5 --repeat-penalty 1.0 --ctx-size 260000 -fa on --no-mmap --jinja --threads -1 --reasoning on --metrics --host 0.0.0.0 --port 18080 --alias llamacpp-model


| model          |             test |             t/s |     peak t/s |             ttfr (ms) |          est_ppt (ms) |         e2e_ttft (ms) |
|:---------------|-----------------:|----------------:|-------------:|----------------------:|----------------------:|----------------------:|
| llamacpp-model |           pp2048 | 1310.20 ± 46.15 |              |      1516.46 ± 107.34 |      1423.41 ± 107.34 |      1516.51 ± 107.34 |
| llamacpp-model |             tg32 |    62.30 ± 1.15 | 64.33 ± 1.19 |                       |                       |                       |
| llamacpp-model |           pp2048 | 1121.88 ± 48.05 |              |       1754.32 ± 36.79 |       1661.27 ± 36.79 |       1754.36 ± 36.79 |
| llamacpp-model |            tg128 |    57.48 ± 0.18 | 58.33 ± 0.47 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d4096 | 1193.61 ± 49.86 |              |      4886.96 ± 163.61 |      4793.91 ± 163.61 |      4887.00 ± 163.61 |
| llamacpp-model |     tg32 @ d4096 |    51.30 ± 1.81 | 52.97 ± 1.87 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d4096 | 1037.91 ± 18.63 |              |       5470.18 ± 99.83 |       5377.13 ± 99.83 |       5470.22 ± 99.83 |
| llamacpp-model |    tg128 @ d4096 |    45.22 ± 0.33 | 47.33 ± 0.47 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d8192 |  984.76 ± 39.95 |              |      9589.74 ± 366.00 |      9496.68 ± 366.00 |      9589.77 ± 366.00 |
| llamacpp-model |     tg32 @ d8192 |    41.97 ± 1.73 | 43.35 ± 1.78 |                       |                       |                       |
| llamacpp-model |   pp2048 @ d8192 |  871.46 ± 19.46 |              |     10633.85 ± 317.92 |     10540.79 ± 317.92 |     10633.89 ± 317.92 |
| llamacpp-model |    tg128 @ d8192 |    37.19 ± 0.20 | 39.33 ± 0.47 |                       |                       |                       |
| llamacpp-model |  pp2048 @ d16384 |  799.20 ± 52.01 |              |    21085.20 ± 1288.85 |    20992.14 ± 1288.85 |    21085.24 ± 1288.84 |
| llamacpp-model |    tg32 @ d16384 |    32.92 ± 1.67 | 34.01 ± 1.72 |                       |                       |                       |
| llamacpp-model |  pp2048 @ d16384 |  657.68 ± 19.21 |              |     25696.87 ± 825.04 |     25603.81 ± 825.04 |     25696.92 ± 825.03 |
| llamacpp-model |   tg128 @ d16384 |    26.94 ± 0.16 | 28.67 ± 0.47 |                       |                       |                       |
| llamacpp-model | pp2048 @ d204800 |  275.69 ± 70.74 |              | 720858.80 ± 156972.64 | 720765.74 ± 156972.64 | 720858.84 ± 156972.65 |
| llamacpp-model |   tg32 @ d204800 |    10.11 ± 0.33 | 11.00 ± 0.00 |                       |                       |                       |
| llamacpp-model | pp2048 @ d204800 |   223.69 ± 0.24 |              |   838401.30 ± 1606.40 |   838308.25 ± 1606.40 |   838401.34 ± 1606.40 |
| llamacpp-model |  tg128 @ d204800 |     9.69 ± 0.03 | 10.67 ± 0.47 |                       |                       |                       |

llama-benchy (0.3.2.dev1+g17b42667a)
date: 2026-03-27 22:16:22 | latency mode: generation


llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3080)   | 20054 = 4218 + (15042 = 12367 +     297 +    2378) +         794 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 3080)   | 20054 = 6540 + (12720 = 11228 +     392 +    1100) +         794 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 3080)   | 20054 = 6630 + (12629 = 11240 +     289 +    1100) +         795 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 3080)   | 20054 = 6736 + (12525 = 10616 +     383 +    1525) +         793 |
llama_memory_breakdown_print: |   - Host               |                  3010 =   970 +       0 +    2040                |

TG Performance is better on large kv-caches, but still not on par with q8 quants

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed spiritbuun#3 (TURBO_D). spiritbuun#1 and spiritbuun#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Optimizations from RotorQuant PR analysis: - TheTom#2: Flat no-WHT dequant kernel (256 threads, 4 elems/thread, no shmem, no syncthreads) — 32x fewer kernel launches than per-block version - TheTom#3: Shared memory centroid LUT in flat dequant kernel - Fused V*attn CUDA kernel (ready for future custom op integration) - Fixed ggml_turbo_wht assert: ne[0] % 32 (was 128) for WHT32 blocks Attempted direct MMVQ for V (no dequant) — incorrect because WHT rotation on attention weights corrupts the V dot product. WHT orthogonality only holds when both sides use matching block structure. Reverted to working WHT linearity approach (cast + post-matmul inv WHT). Final benchmarks on Qwen3.5-35B-A3B (RTX 5090): Prefill Decode vs f16 f16 baseline: 6860 187 100% turbo3 FA: 6832 188 100% ← parity! turbo3 MMVQ: ~60 139 74% Compression: 4.6x KV cache Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dan-and · 2026-03-28T15:17:21Z

@signalnine , would you be so kind and take a look at #13 ?

TheTom#13) Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) — e.g. GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 — previously hard-crashed with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu. Fix: - supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0, so llama.cpp falls back to an unquantised KV path instead of asserting - get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when K head dim is not 128-aligned (VEC kernel only instantiated for D∈{64,128,256}) Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned heads silently use f16 KV while 128-aligned heads continue to use turbo3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…heTom#13) Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash, DeepSeek2 MLA with head_dim=192/576) previously crashed with GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention. Changes: - llama-kv-cache.cpp: detect incompatible head_dim at KV cache init, log a warning, and auto-fall back to q8_0 instead of crashing - ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0 - fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D - ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float) so CPU flash attention can handle turbo3 K/V for models where CUDA FA returns NONE (e.g. D=192 which has no CUDA FA kernel for any type) - set-rows.cu: add tail kernel for non-128 remainder (future use) - turbo-wht.cu: head-dim-aware processing with tail pass-through - ops.cpp: CPU WHT head-dim-aware with tail identity copy Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3 and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128) continues to use turbo3 at full speed (223 t/s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT groups instead of crashing or falling back to q8_0. Key changes: **64-element WHT groups:** - turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64 - set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size from SET_ROWS op_params (set by llama-kv-cache based on head_dim) - turbo-wht.cu: templated on group_size, head-dim-aware dispatch - ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size parameter (0=auto) to handle MLA where output dim differs from K head dim - ops.cpp: CPU WHT parameterized on group_size **MLA fix — missing Q rotation in K-only build_attn:** - build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K> = garbage. Now all three build_attn variants apply Q rotation. - Inverse WHT moved inside build_attn_mha (before v_mla projection) and also added to the non-FA attention path. **Guards + fallback:** - supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64 - llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0 - CPU type_traits_cpu turbo3 entry for CPU FA vec_dot Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s, Qwen3.5 (128) 223 t/s — all producing correct output with turbo3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e versa) Users can now mix turbo3 and q8_0 independently for K and V caches: -ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality -ctk q8_0 -ctv turbo3 # opposite tradeoff Changes: - New VEC FA template instances for turbo3/q8_0 cross-type pairs (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu) - fattn-vec.cuh: extern declarations for mixed instances - fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard - llama-graph.cpp: inverse WHT guard now checks V type (not K type) — when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse WHT must not fire. For MLA, V is a view of K so v->type correctly reflects K's turbo type. Tested: - 16/16 coherence tests pass (4 models × 4 KV combos) - NIAH at 4K+32K: all combos within baseline variance (10-11/11) q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11, turbo3/q8_0: 9/11 (1 extra miss, normal variance) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and zeroed out qs/signs — any CPU-only path (-ngl 0) silently produced garbage output. Now implements full PolarQuant: L2 norm → normalize → forward WHT rotation (signs1 → butterfly → signs2) → 3-bit centroid quantize → pack qs/signs → corrected norm (grp_norm / recon_norm). WHT group size (128 or 64) is communicated from the CPU SET_ROWS handler via a global variable, read from the op_params that llama-kv-cache.cpp sets based on head_dim. This handles models with different K and V head dims (e.g. DeepSeek2 K=192→64, V=128→128). Tested CPU-only (-ngl 0): - Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s - DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s - GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value, 6.4x compression vs f16. Same architecture as turbo3 but with a 4-centroid Lloyd-Max codebook instead of 8. Block format: block_turbo2_0 = 10 bytes per 32 values - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm) - qs (8B): 2-bit centroid indices, 4 per byte Usage: -ctk turbo2 -ctv turbo2 Full stack: - ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct - ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize - turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant - set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT) - dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion - fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2 - fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances - Mixed types: turbo2/q8_0 cross-type FA instances - common/arg.cpp: CLI --cache-type-k turbo2 - llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration NIAH (Qwen3.5 35B, RTX 5090, 4K-512K): 4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11 Coherence: 31/31 across 4 models x all KV combos (GPU + CPU) Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The mixed KV types commit changed the inverse WHT guard from checking k->type to v->type. For MLA, V is a view of K's first 512 elements (v->ne[0]=512, 512%128=0 → group=128). But K was quantized with 64-group WHT (k->ne[0]=576, 576%128≠0 → group=64). The mismatch produced garbage output on GLM-4.7 Flash. Fix: derive group_size from K when K is turbo (since K determines the rotation used during quantization), from V only when K is not turbo. Tested: 16/16 coherence (4 models × 4 KV combos), including GLM-4.7. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

InnerQ equalizes K channel variances before WHT rotation to reduce quantization error on models with anisotropic K distributions. Enabled via TURBO_INNERQ=N env var (N = calibration token count). Pipeline: calibrate per-channel K² stats → compute scales → apply scale to K before WHT (SET_ROWS) → apply 1/scale to Q and V output via ggml_turbo_wht op (scale passed as src[1]). Math: <Q/s, s*K> = <Q, K> preserves dot products. Key design: - d_innerq_active starts at 0 (CUDA zero-init) — kernel never multiplies by uninitialized scales - Auto-disables for 64-group models (group_size < 128) where channel mapping is not 1:1 across WHT groups - Auto-disables when max channel ratio < 1.2 (already balanced) - Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv tensor updates between SET_ROWS and graph-side WHT - Scale_inv tensor stored in KV cache, initialized to identity Also: turbo2 (2-bit) now requires head_dim divisible by 128. At 64-group WHT, the 4-centroid codebook doesn't have enough resolution — DeepSeek (head_dim=192) produced garbage. turbo2 auto-falls back to q8_0 for non-128-aligned heads with a warning. Tested: 18/18 coherence (4 models × 4+ KV combos), InnerQ active on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

In the VEC flash-attention kernel, skip V dequantization for KV positions where ALL attention weights are below 1e-6. At long context most positions contribute noise, not signal — skipping their V dequant saves compute with zero quality impact. Both half2 and f32 paths are covered. The threshold (1e-6) matches the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below this are flushed to zero. Benefit scales with context length — at short context nearly every position contributes so no skip occurs. At 512K+ most positions are skipped during generation. Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence. Generation at 512K: 25 t/s, prefill: 2535 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads) causes catastrophic quality loss on some models: - DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline - GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%) The 64-group WHT passed NIAH/coherence but the weaker decorrelation (6 stages, 3 groups per 192-dim head) doesn't preserve statistical quality. Straight PolarQuant (no WHT) is even worse. All turbo types now require head_dim divisible by 128. Models with non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to q8_0 with a warning. The 64-group WHT code remains for future investigation but is not used. PPL summary (Qwen3.5, head_dim=128, wikitext-2): f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom · 2026-03-29T02:26:48Z

Testing in progress

TheTom · 2026-03-29T02:35:21Z

Metal Regression Test Report

Branch: test/pr3-merge (PR #3 merged into feature/turboquant-kv-cache)
Build: Metal compiled clean — cmake -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON
Hardware: Apple M5 Max 128GB
Merge: 1 conflict in ggml-turbo-quant.c (resolved — kept both 4-bit centroids + CPU WHT sign arrays)
ISWA fix: Updated to new 5-arg ggml_turbo_wht signature, removed redundant V inverse WHT from ISWA overload (now handled inside build_attn_mha)

PPL — Zero Regression

All 9 PPL values match pre-merge baselines to 4 decimal places.

Model	Type	q8_0	turbo3	turbo4	turbo3 vs q8_0	turbo4 vs q8_0
Qwen3.5-35B-A3B Q8_0	MoE	6.1109	6.1756	6.1250	+1.06%	+0.23%
Gemma 2 27B IT Q4_K_M	ISWA	7.0590	7.1489	7.1794	+1.27%	+1.71%
Qwen3.5-27B Q8_0	Dense	6.8884	7.0066	6.9378	+1.72%	+0.72%

NIAH (8K, 3 depth positions)

Model	q8_0	turbo3	turbo4
Qwen 35B MoE	2/3	3/3 ✅	2/3
Gemma 2 27B ISWA	3/3 ✅	3/3 ✅	3/3 ✅
Qwen 27B Dense	3/3 ✅	3/3 ✅	3/3 ✅

MoE 2/3 on q8_0 and turbo4 is a known baseline variance at this context length, not a regression.

Decode Speed (tg128)

Model	q8_0	turbo3	turbo4	Note
Qwen 35B MoE	85.44 t/s	75.60 t/s	76.68 t/s	Clean run ✅
Gemma 2 27B ISWA	27.58 t/s	23.96 t/s	24.54 t/s	Clean run ✅
Qwen 27B Dense	5.55 t/s	7.73 t/s	16.76 t/s	⚠️ GPU contention — ran simultaneously with long-context bench

Codex Code Review (gpt-5.3-codex)

Reviewed all shared-risk files (llama-graph.cpp, llama-kv-cache.cpp, ggml-turbo-quant.c, ggml.c, ggml.h). Findings:

Thread-safety (medium): turbo3_cpu_wht_group_size global in CPU quant path could race under concurrent quantization. Not Metal-relevant (Metal uses GPU SET_ROWS kernel).
Metal WHT kernel gap (low for now): Metal kernel_turbo_wht doesn't read new group_size from op_params[4] or src[1] (InnerQ scale). Safe for all hd128 models since group_size defaults to 128 and InnerQ scale is identity (1.0). Will need updating for non-128 head_dim support.
head_dim fallback (low): Fallback to q8_0 checks n_embd_head_k only, not n_embd_head_v. Not an issue for any model we tested (K and V head dims match on all tested architectures).
Signature migration: All ggml_turbo_wht callers updated to new 5-arg signature. Old signature would fail to compile.
ISWA V inverse: Confirmed correct — build_attn_mha now handles inverse WHT for both FA and non-FA paths. No double-rotation risk.

Merge Resolution Details

ggml-turbo-quant.c: Kept our nearest_centroid_4bit() (16 centroids for turbo4 4-bit PolarQuant) + signalnine's turbo_cpu_s1/s2 WHT sign arrays and turbo_cpu_fwht() CPU rotation function
llama-graph.cpp ISWA overload: Updated Q rotation to ggml_turbo_wht(ctx0, q, 0, 0, innerq_scale) (new signature), added GGML_TYPE_TURBO2_0 to type checks, removed V inverse WHT (now in build_attn_mha)

Verdict

No Metal regressions. All PPL, NIAH, and decode results match or exceed pre-merge baselines across MoE, Dense, and ISWA architectures. Safe to merge.

Tested via scripts/turbo-quick-bench.sh (PPL wikitext-2 c=512 8-chunk, decode llama-bench tg128, NIAH 3 positions at 8K).

…back) Instead of falling back to q8_0 for models with non-128-aligned heads, pad each head to the next multiple of 128 before WHT rotation. The padded elements are zero and don't affect dot products since WHT preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>. This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and other non-128 head_dim models. Padding overhead is wasted bits on the zero-padded quantized elements: head_dim=192 → 256 (33% overhead, still 3.5x compression) head_dim=576 → 640 (11% overhead, still 4.1x compression) Changes: - llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad k_cur/v_cur per-head via ggml_pad before set_rows, return padded views from get_k/get_v - llama-graph.cpp: pad Q per-head before turbo_wht, extract original V head_dim from attention output after inverse WHT. All three build_attn variants (KV, K-only/MLA, ISWA) updated. PPL (wikitext-2, 512 context, 8 chunks): DeepSeek (192→256): 11.61 vs 9.90 baseline (+17%) Previously: 344,304 with 64-group WHT (catastrophic) GLM-4.7 (576→640): 9.11 vs 14.97 baseline Qwen3.5 (128): 6.31 unchanged Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolved conflict in ggml-turbo-quant.c (kept both 4-bit centroids and CPU WHT). Updated ISWA build_attn to use new ggml_turbo_wht 5-arg signature. Removed redundant V inverse WHT from ISWA overload (now handled in build_attn_mha). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

TheTom · 2026-03-29T02:37:36Z

Merged manually via command line — conflict in ggml-turbo-quant.c and llama-graph.cpp (ISWA signature update) resolved locally. All changes are now in feature/turboquant-kv-cache at commit 172fc85. See Metal regression test report above. Thanks @signalnine!

TheTom · 2026-03-29T02:38:44Z

To clarify the merge process: all 14 of @signalnine's commits are in feature/turboquant-kv-cache with full author attribution preserved (Gabe Ortiz gabe@signalnine.net).

We merged locally because GitHub couldn't auto-merge due to conflicts in ggml-turbo-quant.c and src/llama-graph.cpp. The resolution:

ggml-turbo-quant.c — signalnine's branch added CPU WHT rotation (sign arrays + turbo_cpu_fwht). Our branch had nearest_centroid_4bit() for turbo4 4-bit PolarQuant. Kept both.
src/llama-graph.cpp — our ISWA build_attn overload (Gemma 2 fix) needed updating to match the new 5-arg ggml_turbo_wht(ctx, tensor, direction, group_size, scale) signature and the V inverse WHT move into build_attn_mha.

The PR shows as "Closed" rather than "Merged" because we pushed the merge commit directly to the base branch. git log shows all commits with correct authorship. No work was lost or re-attributed.

TheTom · 2026-03-29T02:39:28Z

Thank you for the contributions!

dan-and · 2026-03-29T02:45:42Z

Awesome. Thanks for your hard work.

Co-Authored-By: Will Hampson <Whamp@users.noreply.github.com>

…urbo4) Per TheTom's correction: - Norm correction: shared effort (spiritbuun=turbo4, TheTom=turbo3) - turbo4 resurrection + asymmetric K/V: TheTom's work - CUDA port: signalnine (PR TheTom#3), not spiritbuun - Block-128 ≠ AmesianX block-256 (storage vs rotation group)

Full attribution for all contributors: - Madreag: 10 CUDA kernel optimizations, 36 K×V combos, 15 LA modes, 3-GPU validation (1,351+ iterations), NIAH quality testing - TheTom: Metal impl, turbo4 resurrection, asymmetric K/V, turbo3 norm correction, block-128 research, sparse V concept - signalnine: original CUDA port (PR TheTom#3), InnerQ equalization - spiritbuun: turbo4 norm correction, inverse FWHT prefill - HyperionMS2040: block-128 SET_ROWS fix (7cb6edb)

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

signalnine and others added 3 commits March 26, 2026 15:46

signalnine changed the base branch from master to feature/turboquant-kv-cache March 27, 2026 00:34

signalnine marked this pull request as draft March 27, 2026 05:50

signalnine marked this pull request as ready for review March 27, 2026 17:40

signalnine closed this Mar 27, 2026

signalnine reopened this Mar 27, 2026

github-actions bot added Nvidia GPU ggml labels Mar 28, 2026

github-actions bot added examples server labels Mar 28, 2026

signalnine and others added 7 commits March 28, 2026 10:01

TheTom closed this Mar 29, 2026

aminya pushed a commit to aminya/llama-cpp-turboquant that referenced this pull request Mar 29, 2026

merge: multi-GPU long-context decode benchmark data (TheTom#3)

62907c1

Co-Authored-By: Will Hampson <Whamp@users.noreply.github.com>

seanrasch mentioned this pull request Mar 29, 2026

fix: stack overflow in turbo_init_rotation + TURBO_D static_assert #23

Merged

3 tasks

signalnine mentioned this pull request Mar 29, 2026

CUDA: OOM at 64K context, SET_ROWS init failure, segfaults on RTX 3090 #14

Open

5 tasks

TheTom mentioned this pull request Mar 29, 2026

TurboQuant quality degradation on Q4_K_M base weight models + turbo4-V anomaly #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CUDA port of TurboQuant3 KV cache — 3.47x compression, 98.5% of F16 decode speed on RTX 5090#3

feat: CUDA port of TurboQuant3 KV cache — 3.47x compression, 98.5% of F16 decode speed on RTX 5090#3
signalnine wants to merge 15 commits intoTheTom:feature/turboquant-kv-cachefrom
signalnine:feature/turboquant-kv-cache

signalnine commented Mar 27, 2026 •

edited

Loading

Uh oh!

signalnine commented Mar 27, 2026

Uh oh!

signalnine commented Mar 27, 2026

Uh oh!

dan-and commented Mar 27, 2026 •

edited

Loading

Uh oh!

dan-and commented Mar 28, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

dan-and commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

signalnine commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Benchmark results (Qwen3.5 35B A3B Q4_K_M, RTX 5090, tg128)

NIAH results (single-needle, Kamradt/RULER methodology)

Key design choices

Uh oh!

signalnine commented Mar 27, 2026

Uh oh!

signalnine commented Mar 27, 2026

Bug fix: VEC flash-attention Q/K stride mismatch (commit 4c91451)

Root cause

Fix (3 lines changed)

Test results (RTX 5090, Qwen3.5-35B-A3B-Q4_K_M)

Uh oh!

dan-and commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dan-and commented Mar 28, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

TheTom commented Mar 29, 2026

Metal Regression Test Report

PPL — Zero Regression

NIAH (8K, 3 depth positions)

Decode Speed (tg128)

Codex Code Review (gpt-5.3-codex)

Merge Resolution Details

Verdict

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

dan-and commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

signalnine commented Mar 27, 2026 •

edited

Loading

Bug fix: VEC flash-attention Q/K stride mismatch (commit `4c91451`)

dan-and commented Mar 27, 2026 •

edited

Loading