TurboQuant - Extreme KV Cache Quantization #20969

kth8 · 2026-03-25T02:54:48Z

kth8
Mar 25, 2026

Google Research just posted a blog and paper about a new algorithm that allows quantizing the KV cache down to under 3 bits with close to 0 accuracy loss.

Blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

Paper: https://arxiv.org/pdf/2504.19874

This could be huge if their claims are true and MLX developers are already jumping on this

https://x.com/Prince_Canuma/status/2036611007523512397

Thought I'd share the news here to see if llama.cpp developers would be interested in adding this feature.

ZombieWorm · 2026-03-25T12:23:28Z

ZombieWorm
Mar 25, 2026

It is also something other vendors out there are championing such as nvidia (KTVC):

Article: https://venturebeat.com/orchestration/nvidia-shrinks-llm-memory-20x-without-changing-model-weights

dentity007 Mar 31, 2026

EDIT (April 1): The data below has been corrected. Original claims about a 92.5% prompt collapse and memory paradox were wrong (RSS measurement, silent request failures). See my correction reply below for accurate numbers.

Corrected DGX Spark GB10 baseline data (llama.cpp build 8399, Nemotron-3-Nano-30B-A3B Q4_K_XL, 128K context, measured via nvidia-smi + llama.cpp internal KV buffer reporting):

Memory:

Cache KV Buffer Total GPU Savings
f16 768 MiB 23,092 MiB baseline
q8_0 408 MiB 22,732 MiB -360 MiB (-47%)
q4_0 216 MiB 22,540 MiB -552 MiB (-72%)
Prompt throughput (tok/s): unaffected by cache type

Context f16 q8_0 q4_0
~6K 1,211 1,207 1,206
~24K 1,153 1,149 1,152
~110K 815 810 813
Generation throughput (tok/s): degrades at long context due to per-token dequantization

Context f16 q8_0 q4_0 q4_0 delta
~6K 44.7 44.9 45.0 +0.7%
~24K 44.6 39.7 39.3 -11.9%
~110K 38.0 25.0 24.0 -36.8%
The generation decode overhead at 110K (37% slower with q4_0) is the bottleneck TurboQuant eliminates by enabling direct computation on quantized values.

Still planning to build from @TheTom's fork and benchmark turbo3/turbo4 on this hardware. GB10 (sm_121) would be the first Blackwell-class validation of the CUDA path.

Full data + methodology comparison: https://github.com/Memoriant/dgx-spark-kv-cache-benchmark

TheTom Mar 31, 2026

@spiritbuun

HyperionMS2040 Mar 31, 2026

@dentity007 The 92.5% prompt processing collapse at 64K is a clear demonstration of why TurboQuant matters. The unified memory result (q4_0 costing more than fp16) is also valuable -- scale/zero-point metadata overhead exceeding compression on shared-memory architectures is worth documenting.

One note on build source: @TheTom's repo (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) has everything integrated -- all of spiritbuun's CUDA work (merged via PRs), block_size=128 optimization (turbo3 compression 4.57x -> 5.12x), HIP/ROCm support, InnerQ, turbo4 prefill optimizations. The block_size=128 change initially broke CUDA (OOB write in set-rows.cu) but that fix is already merged (PR #32). Building from TheTom's HEAD gives you a working CUDA path with the improved compression ratios.

Your SM 121 would be the first Blackwell-class validation of the block_size=128 CUDA path. @AmesianX has Blackwell results on a separate implementation but not on this codebase. Would be interesting to see how the WHT rotation performs on that hardware.

TheTom Mar 31, 2026

Small correction on attribution. The CUDA work in my repo is from @signalnine (CUDA port merged as PR #3, plus InnerQ per-channel equalization). spiritbuun has their own separate CUDA fork with different optimizations (which is what i believedentity007 is using?). I've been collaborating with spiritbuun and he's been doing amazing work on the CUDA side. Both are great contributions, just want to make sure credit goes to the right people.

Also a few clarifications:

Block size 128 is a storage block size change (1 norm per 128-element rotation group instead of 4 identical copies). The rotation group itself is still 128 elements... this is different from AmesianX's block_size=256 which changes the rotation group size
turbo4 resurrection (7 bugs, PPL 679 → 6.125) and asymmetric K/V discovery were my work.... there's a full paper for each in my repo
Norm correction was a shared effort: spiritbuun did turbo4, I did turbo3 (different repos)

Appreciate the detailed summary though. and thanks again for the block_size=128 CUDA fix (PR #32), that was a fast catch.

dentity007 Apr 1, 2026

Correction to my post above: the benchmark data I shared was flawed. u/audioen on r/LocalLLaMA caught the methodology error and they were right.

What was wrong:

"92.5% prompt throughput collapse at 64K" -- Wrong. I measured throughput from requests that failed silently. Prompt throughput is identical across all cache types at all context lengths.
"q4_0 uses MORE memory than f16" -- Wrong. I measured RSS, which does not capture GPU memory on unified memory. Actual nvidia-smi + llama.cpp internal reporting shows q4_0 saves 552 MiB (72% KV reduction).
Corrected data (nvidia-smi + llama.cpp KV buffer, same hardware/model):

Memory:

f16: 768 MiB KV buffer, 23,092 MiB total GPU
q8_0: 408 MiB, 22,732 MiB (-47% KV)
q4_0: 216 MiB, 22,540 MiB (-72% KV)
Prompt throughput (tok/s): no difference across cache types (815/810/813 at 110K)

Generation throughput (tok/s): degrades at long context

~6K: f16 44.7 / q4_0 45.0 (+0.7%)
~24K: f16 44.6 / q4_0 39.3 (-11.9%)
~110K: f16 38.0 / q4_0 24.0 (-36.8%)
The real finding: generation decode speed degrades 37% at 110K with q4_0 due to per-token dequantization overhead. This is the bottleneck TurboQuant addresses, but it is a 37% generation penalty, not a 92.5% prompt collapse.

@HyperionMS2040 apologies for the misleading data in my earlier post. The "unified memory paradox" was a measurement artifact.

Still planning to build from TheTom's fork and benchmark turbo3/turbo4 on GB10 (sm_121). Will report corrected numbers.

Full corrected data + methodology comparison: https://github.com/Memoriant/dgx-spark-kv-cache-benchmark

unixsysdev · 2026-03-25T13:08:24Z

unixsysdev
Mar 25, 2026

I've got something going here: unixsysdev/llama-turboquant@16e93d5
Builds and works on Strix Halo - details in the README - https://github.com/unixsysdev/llama-turboquant/blob/main/README.md

PS: Closer to optimal.

1 reply

CISC Mar 25, 2026
Collaborator

Be sure to read the CONTRIBUTING.md Pull requests section, esp. the part about adding new data types.

veritatisquaesitoressumus · 2026-03-25T14:28:08Z

veritatisquaesitoressumus
Mar 25, 2026

Working TurboQuant Implementation Available
I have a working implementation of TurboQuant (Zandieh et al., ICLR 2026) ready for review and integration.
What's built:
CPU implementation (C, no dependencies): quantize, dequantize, rotation matrix generation, bit-packing. 18/18 tests passing, MSE matching the paper within 1%.
CUDA kernels: GPU quantize/dequant + fused attention dot product. Written, awaiting GPU validation.
Integration spec for llama.cpp: 6-phase plan covering GGML type registration, KV cache write/read paths, flash attention integration, CLI flags.
Pre-computed Lloyd-Max codebooks for d=128 (JSON, ready to embed as compile-time constants).
Validated results:

TQ3 (3-bit): MSE = 0.034 (paper: 0.034), 4.9x compression vs FP16
TQ4 (4-bit): MSE = 0.009 (paper: 0.009), 3.8x compression vs FP16

Memory layout: block_tq3 = 52 bytes per 128-value vector (4 byte norm + 48 bytes packed indices). Maps cleanly to attention head dimension.
Practical impact:
On a 70B Q4_K_M model with 34GB free VRAM for KV cache:
FP16 KV: ~109K token context
Q8_0 KV: ~218K token context
TQ3 KV: ~536K token context
Full 262K native context windows fit entirely in VRAM on consumer multi-GPU setups (e.g., 3x RTX 3090).
Integration approach:
The initial path (Phase 4a) is non-fused: dequantize TQ3 blocks back to FP16 before flash attention runs. Zero changes to existing FA kernels, zero risk to correctness. The fused path (Phase 4b) is an optimization that computes Q dot dequant(K) directly using pre-rotated queries.
I posted the full implementation, test results, and detailed integration spec here. Full source: https://gist.github.com/veritatisquaesitoressumus/6aa5973955007ffd858889c76aa60408
The implementation follows Algorithm 1 (TurboQuant_mse) from the paper. Algorithm 2 (QJL error correction) is omitted as the paper shows MSE-optimal quantization alone is sufficient for KV cache compression without the extra bit cost.
Build and test: gcc -O2 -o tq_test ggml_turboquant.c tq_test.c -lm && ./tq_test

0 replies

TheTom · 2026-03-25T14:32:50Z

TheTom
Mar 25, 2026

I have a working implementation of TurboQuant as native KV cache types in llama.cpp with Metal GPU support.

Repo: https://github.com/TheTom/turboquant_plus

What's working:

Two new types: turbo3 (3.25 bits/val, 4.9x compression) and turbo4 (4.25 bits/val, 3.8x compression)
--cache-type-k turbo3 --cache-type-v turbo3 works end-to-end on Apple Silicon
Metal kernels for SET_ROWS, dequantize, and flash attention
Coherent text generation on Qwen 3.5 35B-A3B and 27B Dense
Python prototype with 141 tests validating the paper's math

Benchmarks (M5 Max 128GB):

Model	q8_0	turbo3	Gap	Compression
Qwen 3.5 35B-A3B MoE	85.5 tok/s	10.7 tok/s	8x	4.9x
Qwopus 27B Dense	17.6 tok/s	5.3 tok/s	3.3x	4.9x

Compression target is met. Speed gap is from the unoptimized WHT rotation (O(d^2) per block). Working on Hadamard rotation (O(d log d)) and fused flash attention dequant next.

Gotcha for anyone else implementing this: Metal JIT silently falls back to CPU if you #include custom headers in ggml-metal.metal. Inline everything or the embedded metallib won't pick it up. Lost hours to this before realizing all my "Metal optimizations" were benchmarking the CPU path.

Happy to collaborate with anyone else working on this.

17 replies

tarruda Mar 26, 2026

Here is llama-bench I had run a few weeks ago:

% llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 134217.73 MB
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           pp512 |        189.67 ± 1.98 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           tg128 |         19.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d10000 |        168.92 ± 0.55 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d10000 |         18.93 ± 0.02 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d20000 |        152.42 ± 0.22 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d20000 |         17.87 ± 0.01 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d30000 |        139.37 ± 0.28 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d30000 |         17.12 ± 0.01 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d40000 |        128.38 ± 0.33 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d40000 |         16.38 ± 0.00 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d50000 |        118.07 ± 0.55 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d50000 |         15.66 ± 0.00 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d60000 |        108.44 ± 0.38 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d60000 |         14.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d70000 |         98.85 ± 0.18 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d70000 |         14.36 ± 0.00 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d80000 |         91.39 ± 0.49 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d80000 |         13.84 ± 0.00 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d90000 |         85.76 ± 0.24 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d90000 |         13.30 ± 0.00 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d100000 |         80.19 ± 0.83 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d100000 |         12.82 ± 0.00 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d150000 |         54.46 ± 0.33 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d150000 |         10.17 ± 0.09 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d200000 |         47.05 ± 0.15 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d200000 |          9.04 ± 0.02 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d250000 |         40.71 ± 0.26 |
| qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d250000 |          8.01 ± 0.02 |

build: d28961d81 (8299)

TheTom Mar 26, 2026

UPDATE: Context scaling regression FIXED.

Turns out the root cause wasn't the graph-side rotation matmul. The custom WHT op (O(d log d) butterfly) gave identical performance to the dense matmul, meaning the matmul was never the bottleneck.

The actual bottleneck: the Metal dequant was re-reading shared qs and signs bytes per element instead of batching them. Unrolling the byte extraction and eliminating the per-element loop fixed it.

Results after fix (M5 Max, Qwen3.5-35B-A3B):

Context	turbo3/q8_0
2K	0.987x
4K	0.989x
8K	0.995x
16K	0.989x
32K	0.995x

Flat at 98.7-99.5% through 32K. No degradation trend. PPL +1.1% (5.471 vs 5.414).

The fix is on experiment/context-scaling-fix branch in both repos, merging to main now.

Full investigation documenting every failed approach (custom WHT op, group-32 rotation) before finding the real fix:
https://github.com/TheTom/turboquant_plus/blob/main/docs/context-scaling-deep-dive.md

@tarruda would love to see your 397B numbers with this fix. The dequant optimization should help significantly on your M1 Ultra since the per-position cost was the main issue, and you have 100+ layers compounding it.

@Aaryan-Kapoor the dequant unroll pattern might help your CPU implementation too. Same principle: batch the byte reads instead of extracting per element.

tarruda Mar 26, 2026

I just built commit @9cd043108d61edcd34a8dddf3378d8de6856dff6 and didn't see any difference from the previous build. That is: speed still degrades very severely with increased context.

I tried to run llama-bench but it appears to fail when -ctk/-ctv parameters are passed to it

TheTom Mar 27, 2026

I just built commit @9cd043108d61edcd34a8dddf3378d8de6856dff6 and didn't see any difference from the previous build. That is: speed still degrades very severely with increased context.

I tried to run llama-bench but it appears to fail when -ctk/-ctv parameters are passed to it

@tarruda a few things since your last test:

ctk/ctv issue: llama-bench works fine with -ctk turbo3 -ctv turbo3 on my build (commit 7d1bd95). which commit/branch are you building from? if you're on upstream llama.cpp, the turbo types won't be registered. you need to build from my fork: https://github.com/TheTom/llama-cpp-turboquant (branch: feature/turboquant-kv-cache)
pre-M5 improvements since your last test: two optimizations landed that specifically help M1/M2/M3/M4 hardware:
- 4-mag LUT: auto-detected on pre-M5 devices, reduces constant memory addresses from 8 to 4 during K dequant. gave +38% decode on M2 Pro at 8K context.
- sparse V dequant: skips V dequantization for positions where softmax attention weight < 1e-6. at long context 90%+ of weights are negligible. gave +22.8% decode at 32K on M5 Max, and +5% even on plain q8_0. both are auto-enabled on your M1 Ultra, no flags needed.
M2 Pro combined results (4-mag + sparse V):
- short decode: 0.73x of q8_0 (up from 0.45x before these fixes)
- blended pp8192+tg128: 0.90x of q8_0

your 397B with 100+ layers will compound differently but these should close the gap significantly vs what you saw before. would love to see your numbers on the latest build.

if you can run this it would help me diagnose any remaining issues on M1 Ultra:

./build/bin/llama-bench -m your_model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -ngl 99 -p 0 -n 128 -pg 8192,128 -pg 16384,128 -pg 32768,128

tarruda Mar 27, 2026

Hi @TheTom I will test it later today. Thanks!

Aaryan-Kapoor · 2026-03-25T21:00:26Z

Aaryan-Kapoor
Mar 25, 2026

Couldn't wait, so I spun something up; hopefully, it helps the final implementation. Feel free to cherry-pick :)

Missed a lot of things- ex. stuff to be fused, but it should offer a good starting point!

Working TurboQuant TQ3_0 implementation (CPU, both K+V cache)

Branch: https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0

Implements Algorithm 1 (TurboQuant_mse) from the paper as GGML_TYPE_TQ3_0:

Algorithm: Randomized Hadamard Transform (WHT + deterministic sign flips) → 3-bit Lloyd-Max optimal scalar quantization per coordinate
Block: 32 values → 14 bytes (2B fp16 scale + 12B packed 3-bit indices) = 3.5 bpw
Scope: CPU-only, both K and V cache supported via vec_dot (enables flash attention)
Constants: Lloyd-Max centroids computed via iterative convergence (178 iterations), sign pattern from golden ratio hash, all verified numerically

Benchmarks (Qwen3.5-35B-A3B Q4_K_M, CPU, 4 threads):

KV Cache	Prompt (t/s)	Gen (t/s)	Context MiB	Compression
f16	19.3	10.6	5,182	1.0x
q8_0	19.9	10.4	~2,591	2.0x
q4_0	19.5	12.5	~1,440	3.6x
tq3_0	20.1	11.4	1,182	4.4x

Output is identical to f16 baseline on the 35B model at temperature 0. Quality degrades on very small models (0.6B) as expected - the paper's claims hold for reasonably-sized models.

Usage: --cache-type-k tq3_0 --cache-type-v tq3_0

Used Opus 4.6 for the implementation, definitely NOT PR READY! Surprised it worked tbh :)

Known limitations:

No GPU kernels (CUDA/Metal/Vulkan) - CPU only
vec_dot does full dequant per block (not fused)
Block size 32 (paper uses 128); a 128-block variant would achieve 3.25 bpw
QJL error correction (Algorithm 2) not implemented - Algorithm 1 alone is sufficient per the paper

4 replies

TheTom Mar 25, 2026

great work Aaryan. a few things I noticed that line up with my implementation (github.com/TheTom/turboquant_plus):

you also dropped QJL (Algorithm 2). I independently found the same thing. all bits to Lloyd-Max centroids is faster, simpler, and perplexity matches. the paper's two-stage approach adds overhead without practical benefit at 3-bit.
your CPU numbers are interesting. basically zero speed penalty (20.1 vs 19.3 prompt). on Metal GPU I'm at 0.78x of q8_0 prefill (2095 vs 2694 tok/s) after moving the WHT rotation from the dequant into the ggml graph. the GPU path is more sensitive to the rotation overhead because the baseline is so much faster.
your block size 32 is what I ended up with too. started at 128 (matching head_dim) but 32 gives better flash attention parallelism.
"output identical to f16 at temperature 0" matches my perplexity results (PPL 6.20 vs 6.19 baseline).

one thing to watch: I hit a nasty bug where the graph-side WHT rotation gave PPL 23.5 instead of 6.2. turned out to be ggml column-major storage silently transposing the rotation matrix. documented the full investigation at github.com/TheTom/turboquant_plus/blob/main/docs/pre-rotate-queries-investigation.md in case you go down the GPU path.

happy to compare notes. nice to see multiple implementations converging on the same findings.

Dampfinchen Mar 26, 2026

Output is identical to f16 baseline on the 35B model at temperature 0. Quality degrades on very small models (0.6B) as expected - the paper's claims hold for reasonably-sized models.

At what context length is the output identical? That is an important question as degradation from KV cache quanting can often only be seen as minor calculation errors accumulate over the context length. So I would suggest trying some complex tasks near the max context of the model and then look if the output is still identical.

Also, is the output still the same as bf16 with a q8_0 and q4_0 kv cache? If not, then that would mean the accuracy is indeed higher even at long context tasks than the current kv cache quants.

Rotatingxenomorph Mar 26, 2026

Sonnet 4.6 thinks the qjl part is important: "For attention scores specifically, unbiasedness matters because errors accumulate across all keys in softmax. A small systematic bias in every dot product can skew the softmax distribution noticeably, whereas zero-mean noise tends to cancel."

Arclabs001 Mar 28, 2026

great work Aaryan. a few things I noticed that line up with my implementation (github.com/TheTom/turboquant_plus):

you also dropped QJL (Algorithm 2). I independently found the same thing. all bits to Lloyd-Max centroids is faster, simpler, and perplexity matches. the paper's two-stage approach adds overhead without practical benefit at 3-bit.

your CPU numbers are interesting. basically zero speed penalty (20.1 vs 19.3 prompt). on Metal GPU I'm at 0.78x of q8_0 prefill (2095 vs 2694 tok/s) after moving the WHT rotation from the dequant into the ggml graph. the GPU path is more sensitive to the rotation overhead because the baseline is so much faster.

your block size 32 is what I ended up with too. started at 128 (matching head_dim) but 32 gives better flash attention parallelism.

"output identical to f16 at temperature 0" matches my perplexity results (PPL 6.20 vs 6.19 baseline).

one thing to watch: I hit a nasty bug where the graph-side WHT rotation gave PPL 23.5 instead of 6.2. turned out to be ggml column-major storage silently transposing the rotation matrix. documented the full investigation at github.com/TheTom/turboquant_plus/blob/main/docs/pre-rotate-queries-investigation.md in case you go down the GPU path.

happy to compare notes. nice to see multiple implementations converging on the same findings.

Same results on me. I found all bits on MSE (Lloyd-max) is always better in top-1 and top-5 token consistency rate, KL-divergence, etc. https://github.com/arclabs001/YATQ . QJL eliminates bias but increases variance, and:

MSE's small bias is tolerated by softmax
QJL's increased variance disrupts Top-1 ranking
MSE-only achieves better Top-K matching at the same bit budget

Madreag · 2026-03-26T22:55:41Z

Madreag
Mar 26, 2026

Got CUDA + Flash Attention turbo3 working on RTX 5090.

Ported @TheTom's Metal turbo3 kernels to CUDA with full Flash Attention support for both K and V.
Fork: https://github.com/Madreag/turbo3-cuda

Hardware: RTX 5090 32GB, CUDA 12.8, sm_120, WSL2 Ubuntu 24.04
Results (Qwen3.5-27B Q6_K, ~21GB weights):

700k had memory pressure.

NIAH: 6/6 exact retrieval
Math, factual, code gen: all passing
KV per token: ~14 KB (turbo3 K+V) vs ~64 KB (fp16) — 4.6× compression

Qwen3.5-27B is a hybrid architecture — only 16 of 64 layers have KV cache (the GatedAttention layers). 16 layers × 4 KV heads × 256 head_dim.

What's implemented (15 files, 4 new + 11 modified):
turbo3 dequantize kernels (fp16/fp32/bf16, contiguous + non-contiguous)
turbo3 quantize with WHT rotation (128-element groups, 4 blocks of 32)
FWHT CUDA kernel for GGML_OP_TURBO_WHT
Flash Attention vec_dot_KQ + dequantize_V templates for turbo3

All dispatch paths: convert, set-rows, get-rows, cpy, MUL_MAT routing (turbo3 excluded from mmvq/mmq, routed through dequant-then-cublas for MUL_MAT)

Build:
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_FORCE_CUBLAS=OFF
Do not use CUDA 13.1 — MMQ kernel segfaults.

Known limitations:
turbo4 not yet ported to CUDA
FA required for turbo3 V (non-FA path materializes O(n²) attention matrix)
Only Tested on sm_120

5 replies

TheTom Mar 26, 2026

This is awesome, thanks for porting and testing! Great to see turbo3 running on 5090 with full FA support.

The dequant-then-cublas routing for MUL_MAT is a smart approach. @spiritbuun has a separate CUDA fork (RTX 3090) that just hit 98.8% of q8_0 prefill speed using a similar dequant-then-MMA path — bulk dequant to fp16 temp buffer during prefill, vec_dot for decode. Their fork also has a norm correction that makes turbo3 PPL beat q8_0 on CUDA. Worth comparing notes: https://github.com/spiritbuun/llama-cpp-turboquant-cuda/tree/feature/turboquant-kv-cache

A few things on the roadmap that might interest you:

Fused compressed attention for decode — precompute Q·centroid partial scores to eliminate constant memory lookups in the FA inner loop. Should flatten the decode speed curve across context depths.
Norm correction — store original_norm / ||reconstruction|| instead of raw norm. Zero decode cost, improves PPL. Already merged on the Metal side.
Auto-enable flash attention — turbo3 silently fails without FA on some hardware. Just pushed a fix that auto-enables FA when turbo cache types are detected (commit 929b8ba).

Good catch on the CUDA 13.1 MMQ segfault and the hybrid architecture detail (16/64 layers with KV cache). Will add both to our docs.

The 700K context result on 32GB is great data. Would you be willing to run our diagnostic script? It standardizes benchmarking across hardware for me to diagnose issues: https://github.com/TheTom/turboquant_plus/blob/main/scripts/README.md

Madreag Mar 26, 2026

The Metal implementation was clean to port from, really solid work!

I will check out @spiritbuun's fork, especially the norm correction — That's a nice optimization.
I'll run the diagnostics and get results up this weekend.

Ezzz-dev Mar 28, 2026

It's impossible for me to compile your branch, for some reason it's reporting me error C2065: 'M_PI' is undeclared, I am following your compilation guide. I have no issues to compile llama.cpp on my Blackwell card, but yours for some reason is bringing me this issue.

Madreag Mar 28, 2026

It's impossible for me to compile your branch, for some reason it's reporting me error C2065: 'M_PI' is undeclared, I am following your compilation guide. I have no issues to compile llama.cpp on my Blackwell card, but yours for some reason is bringing me this issue.

Fixed in the latest commit (1545941). The issue was a missing #define _USE_MATH_DEFINES before the includes in
ggml/src/ggml-turbo-quant.c — MSVC doesn't define M_PI in <math.h> without it.

Pull the latest from main and rebuild:

git pull
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build build -j%NUMBER_OF_PROCESSORS%

Let me know if you hit anything else.

Ezzz-dev Mar 28, 2026

Yeah, it's giving another set of errors now:

error C2375: 'quantize_row_turbo3_0_ref':
nueva definición; vinculación distinta [C:\Users\soyal\Desktop\Proyectos\turbo3-cuda\build\ggml\src\ggml-cpu.vcxproj]
(compilando archivo de origen "../../../ggml/src/ggml-cpu/amx/mmq.cpp")
C:\Users\soyal\Desktop\Proyectos\turbo3-cuda\ggml\src\ggml-cpu\quants.h(37,6):
vea la declaración de 'quantize_row_turbo3_0_ref'

  basically means "new definition, different linking"

khimaros · 2026-03-27T01:18:02Z

khimaros
Mar 27, 2026

anyone working on Vulkan backend?

11 replies

stragulus Mar 31, 2026

@jesusmb1995 That fails for me with

home/user/git/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6242: GGML_ASSERT(pipeline->parameter_count == descriptor_buffer_infos.size()) failed

Confirmed it runs without the cache key params

jesusmb1995 Mar 31, 2026

@jesusmb1995 That fails for me with
home/user/git/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6242: GGML_ASSERT(pipeline->parameter_count == descriptor_buffer_infos.size()) failed
Confirmed it runs without the cache key params

@StrangeBytesDev Recently fixed a crash a AMD integrated GPU. So far for two GPU's I tested Nvidia 1080Ti and RADV GDX1150. Can you re-build with latest commit and see if it also did fix it for you?

stragulus Mar 31, 2026

@jesusmb1995 Can confirm it now passes with flying colors, and now GPU is the bottleneck as expected. I'll go play around some more, thanks for the update!

jesusmb1995 Mar 31, 2026

Great, added support for mixed K-V types (only a subset of them is supported, tbq pq q8 f16) with flash attention kernel on few recent commits. The original case that was failing for you (--cache-type-k pq4_0 --cache-type-v tbq4_0 -fa on), should work but please report if you still find bugs.

(Note: theses results below are from a different laptop, worse speed in general, disclaimer rather a short run. On this machine pq3/pq4 for both had the best speed)

==========================================
 Mixed K/V Results Summary (not exhaustive, examples)
==========================================
  K type       V type              PPL       vs f16       Time
  ------       ------              ---       ------       ----
  tbq3_0       pq3_0            7.0303        1.91%    56.88s (1.8x)
  tbq4_0       pq4_0            6.8159       -1.19%    53.49s (1.7x)
  tbq3_0       q8_0             7.0515        2.22%    52.63s (1.6x)
  tbq4_0       f16              6.7608       -1.99%    58.00s (1.8x)
  q8_0         pq3_0            6.9426        0.64%    31.20s (1.0x)
  f16          pq4_0            6.8901       -0.12%    31.41s (1.0x)
==========================================

stragulus Mar 31, 2026

==========================================
 Mixed K/V Results Summary
==========================================
  K type       V type              PPL       vs f16       Time
  ------       ------              ---       ------       ----
  tbq3_0       pq3_0            7.0171        1.74%     3.60s (1.2x)
  tbq4_0       pq4_0            6.7869       -1.60%     3.60s (1.2x)
  tbq3_0       q8_0             7.0181        1.75%     3.60s (1.2x)
  tbq4_0       f16              6.7470       -2.18%     3.60s (1.2x)
  q8_0         pq3_0            6.9226        0.37%     3.09s (1.0x)
  f16          pq4_0            6.8837       -0.19%     3.09s (1.0x)
==========================================

from your tests/test-kv-cache-quantization.sh on a radeon 7900XTX

EDIT, and non-mixed results

  Type              PPL       vs f16       Time
  ----              ---       ------       ----
  tbq3_0         7.0171        1.73%     3.60s (1.2x)
  q4_0           6.7860       -1.62%     3.10s (1.0x)
  pq3_0          7.0713        2.52%     3.10s (1.0x)
  tbq4_0         6.7869       -1.60%     3.60s (1.2x)
  f16            6.8976   (baseline)      3.09s
  q8_0           6.8717       -0.38%     3.10s (1.0x)
  pq4_0          6.7561       -2.05%     3.08s (1.0x)

Dampfinchen · 2026-03-27T09:44:02Z

Dampfinchen
Mar 27, 2026

https://github.com/spiritbuun/llama-cpp-turboquant-cuda

This is a fork of Tom's implementation with CUDA support. Results look promising.

As per their twitter account spiritbuun.

12 replies

ubergarm Mar 27, 2026

@Dampfinchen

I tested and turbo3 and turbo4 look much worse than even q4_0 in my testing: ikawrakow/ik_llama.cpp#1509 (comment)

I'm running llama-perplexity over entire wiki.test.raw. command and method shown in the linked details.

XZVB12 Mar 27, 2026

Can you try again with latest?

seems after Commit c99c230 the model loads properly if the model is fully placed in vram, Thank You
but can’t use -ngl also does not seem to be a significant reduction in memory consumption

iMilnb Mar 28, 2026

I was having good results with this fork until commit 1010625, since then I can't load the model I used anymore (OOM):

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 19905.15 MiB on device 0: cudaMalloc failed: out of memory

FWIW

llama-server -hf mradermacher/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M -c 262144 --fit on -np 1 -fa on -t 10 --jinja -ctk turbo3 -ctv turbo3 --temp 0.6 --top_p 0.95 --top_k 20, --min_p 0.0 --presence_penalty 0.0 --repeat_penalty 1.0 --port 8001 --host 0.0.0.0

Running at 70 tp/s on a 16GB RTX 5080 (before said commit) vs ~65 with -ctx 8_0 -ctv 8_0.

spiritbuun Mar 28, 2026

I was having good results with this fork until commit 1010625, since then I can't load the model I used anymore (OOM):

Sorry about that. This is fixed now. Please pull and retry. Thanks for testing!!

iMilnb Mar 29, 2026

I was having good results with this fork until commit 1010625, since then I can't load the model I used anymore (OOM):
Sorry about that. This is fixed now. Please pull and retry.

Confirmed working, thanks!

Naster17 · 2026-03-27T10:42:35Z

Naster17
Mar 27, 2026

So it's already in the main repo of llama.cpp?

1 reply

XZVB12 Mar 27, 2026

no

Xcc313r4n7 · 2026-03-27T20:47:50Z

Xcc313r4n7
Mar 27, 2026

Is no one else seeing the obvious here?
The radix economy dictates that base 3 is the most efficient number base.
Microsoft's bitnet, which is open source proved that ternary emulated weights are more power efficient with a higher capability/parameter ratio.
No one is explicitly saying it, but turboquant is ternary as well... Everyone always points out how energy hungry modern AI is and how the human brain only uses 20 watts of power. But neurons are also ternary, having three states, resting state (0) excitatory state (+1) and inhibitory state (-1). Evolution selected for this because it selects for energy efficiency. But this isn't surprising either. Reality itself optimizes for efficiency. The proton the neutron and the electron. The fact that there are three isn't coincidental, the radix economy proves that three is the optimum efficiency. The minimum stable structure, the triangle, three dimensions of space, three aspects of time (past, present and future). Even boolean logic requires a minimum of three gates AND,OR, NOT.

5 replies

JamesFlare1212 Mar 28, 2026

The most efficient is the e-bit, which can be derived, approximately 2.718. Since 3 is closer to e than 2, it is more efficient. The Soviet Union once promoted ternary computers, but they were not continued for various reasons.

hwpoison Mar 28, 2026

The minimum stable structure, the triangle, three dimensions of space, three aspects of time (past, present and future). Even boolean logic requires a minimum of three states AND,OR, NOT.

AND, OR and NOT are operations, not states. States are 0 or 1.

Xcc313r4n7 Mar 28, 2026

The minimum stable structure, the triangle, three dimensions of space, three aspects of time (past, present and future). Even boolean logic requires a minimum of three states AND,OR, NOT.

AND, OR and NOT are operations, not states. States are 0 or 1.

true, I mistyped. STT transcribed "gates" as "states"

Xcc313r4n7 Mar 28, 2026

The most efficient is the e-bit, which can be derived, approximately 2.718. Since 3 is closer to e than 2, it is more efficient. The Soviet Union once promoted ternary computers, but they were not continued for various reasons.

It was technological limitation at the time, difficult to manufacturer, but a South Korean research team created a ternary logic chip using existing binary chip manufacturing techniques a few years ago, so it's viable now.

Xcc313r4n7 Mar 28, 2026

The most efficient is the e-bit, which can be derived, approximately 2.718

I mean most efficient integer

scos-lab · 2026-03-28T03:46:29Z

scos-lab
Mar 28, 2026

Engineering Findings from 8-Model TurboQuant Benchmark

We independently implemented TurboQuant from scratch (Python/NumPy, 49 tests, distortion matches paper ±15%) and ran systematic benchmarks across 8 models from GPT-2 (124M) to Qwen2.5-7B (7.6B). Sharing findings that may be useful for the llama.cpp integration:

Finding 1: K/V Norm Disparity

The paper does not discuss this. Modern LLMs have dramatically different Key vs Value vector magnitudes:

Model	K mean norm	V mean norm	Ratio
GPT-2 (124M)	11.8	2.0	6x
Phi-2 (2.8B)	13.1	3.0	4x
Qwen2.5-3B	172.1	3.3	52x
Qwen2.5-7B	274.0	2.6	106x
Qwen2.5-1.5B	778.6	4.3	182x

Since quantization error scales with norm squared, K needs far more bits than V. The K/V ratio predicts the optimal bit budget:

K/V < 10x   → 3-bit uniform works     (GPT-2 family)
K/V 10-60x  → 4.5-5 bit asymmetric    (Phi-2, Qwen-3B)
K/V > 100x  → 5.5+ bit or mixed prec. (Qwen-1.5B, 7B)

Finding 2: MSE > Prod for Attention

The paper recommends TurboQuantProd (QJL residual) for Keys. Our tests show MSE for both K and V works better in practice:

GPT-2 b=3	MSE (both)	Paper (Prod keys)
PPL change	+7.6%	+300%

QJL adds variance that softmax amplifies. Low variance (MSE) beats unbiasedness (Prod).

Finding 3: Outlier-Aware Mixed Precision

~5-20% of K channels (especially Layer 0) have 10-100x larger RMS than median. Storing outlier channels at 8-bit, rest at 3-bit:

Method	Avg bits	PPL change (Qwen2.5-1.5B)
Uniform K=6, V=3	4.5	+78.1%
Mixed K=3, V=3	3.6	+2.1%

Finding 4: Compressed Storage Verified

Actual memory savings: GPT-2 89% reduction, 9x compression, zero PPL impact.

Repo

Full implementation, benchmarks, and data: https://github.com/scos-lab/turboquant

~2,500 LOC Python, 49 tests, MIT license. Hope these findings help with the llama.cpp integration.

3 replies

Rotatingxenomorph Mar 28, 2026

Are you saying they missed the 300% PPL increase that you're seeing? Or did they not do what you think they did?

scos-lab Mar 28, 2026

@Rotatingxenomorph Good question. To clarify: the 300% PPL increase was specifically from using TurboQuantProd (the QJL residual correction method from Theorem 2 of the paper) for Key quantization at b=3 on GPT-2 (head_dim=64).

The paper's actual deployment likely uses:

Mixed precision — their "3.5-bit" is an average (some channels get more bits, some fewer), not uniform 3-bit
Higher head_dim — they test on Llama-3.1-8B (head_dim=128), where QJL variance is halved (scales as 1/d)
PolarQuant — a companion technique for polar coordinate decomposition that we did not implement

Our finding is that for practical attention computation, the QJL correction's added variance hurts more than its unbiasedness helps — especially at small head_dim and low bit-width. Using MSE quantization for both K and V gives much better results (the same distortion bound applies, just without the unbiasedness guarantee).

So it's not that the paper is wrong — their theoretical analysis is correct. It's that the practical implementation benefits from MSE over Prod in the regimes we tested (head_dim=64, b=2-4). At head_dim=128+ with their full system (PolarQuant + mixed precision + outlier handling), Prod may perform better.

The K/V norm disparity data is the more actionable finding for llama.cpp — it directly affects how you allocate bits between K and V quantization.

Rotatingxenomorph Mar 28, 2026

I must just be misunderstanding the whole thing. I thought they applied mse to both k and v and then QJL to the inner product estimator which also applies to both.

animehacker · 2026-03-28T05:31:52Z

animehacker
Mar 28, 2026

I've been working on extending unixsysdev's tq3_0 implementation with V cache support and flash attention. Repo here: https://github.com/animehacker/llama-turboquant

What this adds on top of unixsysdev's work:

Normalization fix (1/32 → 1/√32 for the asymmetric K-side WHT)
V cache compression (non-transposed storage + graph-side dequant to work around GGML's element-scatter path)
Flash attention with tq3_0 (dequant tq3_0 → F32 → F16 in the attention graph, then use existing FA kernel)
CPU backend F32 dequant path for pipeline parallelism
Tested on Llama-3.3-70B-Instruct-Q4_K_M, 2x RTX 3090:

72K context with tq3_0 K+V (4.57x compression)
WikiText-2 PPL: 4.40 vs 4.09 baseline (+7.6%)
To be clear: this implements PolarQuant (Stage 1) only — WHT rotation + 3-bit Lloyd-Max. QJL residual correction is not included.

Paper with implementation details: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html

0 replies

zcattacz · 2026-03-28T07:01:42Z

zcattacz
Mar 28, 2026

Seems like this tq3 quantization works well. When could it be used on model weights to replace the useless -q3- models?

2 replies

animehacker Mar 28, 2026

Seems like this tq3 quantization works well. When could it be used on model weights to replace the useless -q3- models?

Thanks! To clarify, tq3_0 is currently KV cache only- not model weights. It works well for KV because the quantization happens online during inference (no calibration data needed), and the WHT rotation makes any input distribution compressible with a fixed codebook.

For weights, the challenge is different. Weights are static so you can afford offline calibration, and formats like q4_K_M already exploit that with per-block scales and importance matrices. Whether WHT + Lloyd-Max beats the existing weight quants at 3-bit is an interesting open question though.

The rotation trick might help since it reduces outlier sensitivity, which is exactly what kills q3_K quality.

The great thing about this is from my testing (on MY specific setup, 2x3090s), the prompt evaluation runs at many hundreds of tokens per second so even though output is only 3-5 TPS, the input being so fast makes it great for high context situations!

brahh85 Mar 28, 2026

This is the first effort I've seen to apply turboquant to weights
https://github.com/cksac/turboquant-model

Arclabs001 · 2026-03-28T13:18:01Z

Arclabs001
Mar 28, 2026

Update Mar 30th 2026: WHT + QJL + MSE is the solution!

In @AmesianX 's implementation, PPL decreased after introducing QJL. At first I thought this is due to @AmesianX comments, i.e., The fix was using independent sign patterns for MSE WHT and QJL SRHT.
However, my implementation have already used different param for MSE and QJL:

# In class TurboQuantMSE
self.rotation = RandomRotation(dim, seed)         # Algorithm 1, matrix Π ∈ R d×d

# In class TurboQuantProd
torch.manual_seed(seed + 10000)
self.S = torch.randn(dim, dim)               # Algorithm 2, QJL projection matrix S ∈ R^{d×d}

Since the only difference is WHT (Walsh-Hadamard Transform), I implemented another version replace random rotation with WHT (https://github.com/Arclabs001/YATQ/blob/main/turboquant_wht.py)

Test Setup

Model: Qwen3-1.7B (28 layers, 8 KV heads, 128 head_dim)
Context Length: 4124 tokens for attention metrics, 1584 tokens for PPL
Baseline FP16 PPL: 4.6562
Fair Bit Allocation: Both WHT and Random Rotation use (bits-1) for MSE + 1 bit for QJL

Perplexity Comparison (Random Rotation vs WHT)

Config	Random PPL	WHT PPL	Random Δ	WHT Δ	Compression
2b MSE	9792	4080	+9787	+4075	8x
2b QJL	16128	2800	+16123	+2795	8x
3b MSE	2048	624	+2043	+619	5.33x
3b QJL	3376	2048	+3371	+2043	5.33x
4b MSE	604	10.12	+599	+5.47	4x
4b QJL	1408	93	+1403	+88	4x
6b MSE	4.78	4.62	+0.12	-0.03	2.67x
6b QJL	4.72	4.66	+0.06	+0.00	2.67x
8b MSE	4.66	4.66	+0.00	+0.00	2x
8b QJL	4.62	4.62	-0.03	-0.03	2x

Attention Score Metrics Comparison

Config	Method	CosSim	Top1%	Top5%	Variance
2b MSE	Random	0.9975	63.8	88.8	320387
2b MSE	WHT	0.9973	62.1	88.4	369370
2b QJL	Random	0.9909	50.0	74.6	383961
2b QJL	WHT	0.9964	67.9	90.6	32637
3b MSE	Random	0.9992	72.8	94.6	48691
3b MSE	WHT	0.9993	73.2	96.0	126877
3b QJL	Random	0.9967	61.2	82.6	61805
3b QJL	WHT	0.9988	64.7	91.1	21797
4b MSE	Random	0.9998	79.9	99.6	7048
4b MSE	WHT	0.9998	83.0	98.2	6021
4b QJL	Random	0.9990	69.6	94.2	29223
4b QJL	WHT	0.9996	78.6	97.8	5455
6b MSE	Random	1.0000	93.3	99.6	758
6b MSE	WHT	1.0000	96.4	99.6	255
6b QJL	Random	0.9999	89.7	99.6	1804
6b QJL	WHT	1.0000	91.5	99.6	503
8b MSE	Random	1.0000	92.9	99.6	150
8b MSE	WHT	1.0000	99.1	100.0	18
8b QJL	Random	1.0000	94.6	100.0	181
8b QJL	WHT	1.0000	98.7	100.0	37

Observations

WHT significantly outperforms Random Rotation at lower bits (2-4 bits):
- 4-bit MSE: WHT PPL 10.12 vs Random PPL 604 (59.65x better)
- 4-bit QJL: WHT PPL 93 vs Random PPL 1408 (15.14x better)
At higher bits (6-8), both methods converge to baseline PPL:
- Performance difference becomes negligible
WHT QJL variance dramatically lower:
- 2b QJL: WHT variance 32637 vs Random 383961 (11.7x lower)
- 4b QJL: WHT variance 5455 vs Random 29223 (5.4x lower)
QJL hurts Random Rotation but benefits WHT:
- Random: 4b QJL PPL 1408 > 4b MSE PPL 604 (QJL makes it worse)
- WHT: 4b QJL PPL 93 is reasonable with additional compression info

Finally, why "random rotation" + QJL makes it worse but WHT + QJL makes it better is still a mystery to me. As in the paper, the author says they used random rotation.

# The paper, Algorithm 1, line 2
Generate a **random rotation** matrix Π ∈ R d×d

(this is infer from claude, maybe explain something)
WHT+QJL vs Random Rotation+QJL:

- WHT is a deterministic orthogonal transform that spreads energy uniformly across dimensions, reducing variance in 
the data structure before QJL estimation.                                                                           
- Random Rotation introduces additional randomness on top of QJL's existing randomness, compounding the variance.   
- QJL relies on Johnson-Lindenstrauss random projections for inner product estimation. Its accuracy depends on input
 variance.                                                                                                          
                                                                                                                    
Result: WHT pre-conditioning reduces variance → QJL estimator becomes more accurate. Random rotation adds variance →
 QJL estimator degrades. The key insight is that deterministic WHT and stochastic QJL complement each other, while  
two stochastic operations (random rotation + QJL) amplify errors.

Mar 28th

Hey everyone! I just finished reproducing TurboQuant (ICLR 2026) purely in torch. This repo supports real QJL by rewriting whole attention and forward process for QWen3 models. And I found the result independantly: In the same bits budget, k-bit MSE is better than k-1 bit MSE + 1 bit QJL

Repo link: https://github.com/arclabs001/YATQ

Background

TurboQuant proposes a clever way to quantize KV caches:

Random rotation + Lloyd-Max scalar quantization (MSE-optimal)
Optional: Add 1-bit QJL on residuals for unbiased inner product estimation

The paper claims QJL eliminates quantization bias, which sounds great in theory. So I implemented both stages and ran extensive tests.

The Surprising Part

QJL actually hurts performance in practice.

Here's what I found on Qwen3-1.7B (4K context), top-1 token consistency rate drops:

Bits	MSE-only Top1%	+QJL Top1%	Difference
2	65.6%	50.0%	-15.6%
3	71.0%	61.2%	-9.8%
4	80.4%	69.6%	-10.7%
6	94.2%	89.7%	-4.5%
8	96.9%	94.6%	-2.2%

MSE-only consistently wins on Top-1 token matching. The gap is huge at low bits and still noticeable at 8-bit.

What's Going On?

The theory says QJL = no bias. That's true! But here's the trade-off:

Bits	Bias Improvement	Variance Increase
2	-11.65%	+17.2%
3	-2.83%	+30.7%
4	-0.88%	+329.8%
6	-0.02%	+368.6%

QJL eliminates bias but explodes variance. And for attention, variance is worse than bias!

Why? Softmax is tolerant to uniform bias:

softmax(scores + constant_bias) ≈ softmax(scores)

But variance randomly perturbs each score, which messes up Top-K ranking:

argmax(scores + noise) ≠ argmax(scores)

So you get "unbiased" estimates that give you the wrong Top-1 token more often.

Another Thing: Both Keys and Values Don't Need QJL

I also tested whether V should use QJL. Short answer: nope.

Bits	MSE only V reconstruction Error	with QJL V reconstruction Error	V-MSE Difference
3	2.45	8.37	3x worse
4	0.67	2.45	3x worse

Values only do weighted sum, so softmax naturally averages out per-vector errors. QJL wastes 1 bit on useless residual info.

My Takeaway

For KV cache quantization:

MSE-only is the way to go at any bit budget
QJL's bias elimination comes at too high a cost (variance explosion)
At ≥3 bits, MSE's bias is already <3% anyway

The implementation is open source if anyone wants to dig deeper or challenge these findings: https://github.com/arclabs001/YATQ

Would love to hear thoughts from the community! Did I miss something? Are there scenarios where QJL actually shines?

0 replies

karambaso · 2026-03-28T18:24:39Z

karambaso
Mar 28, 2026

Why not to compress the weights?

For small quants there are very few values per 4/3 bits (16 or 8), it means there are a lot of equal values. Very simple encoding with bit strings easily reduce model's size twice or even more. It requires some computation to uncompress, but it is done in cache and takes a small time when inference is not compute bound, but memory throughput, so there is an extra time for decompression. Prompt processing will be a bit slower, but token generation increases twice or more. A big leap to ignore it.

0 replies

Madreag · 2026-04-01T07:06:51Z

Madreag
Apr 1, 2026

TurboQuant CUDA Optimized & Benched

Summary

Full optimized CUDA implementation of TurboQuant KV cache compression for llama.cpp, targeting NVIDIA GPUs (SM86+). All 4 turbo types implemented with native flash attention, parallel SET_ROWS encoding, and aggressive kernel optimizations. Validated across 4 GPUs spanning 3 architecture generations (Ampere, Ada, Blackwell).

Repo: Madreag/turbo3-cuda branch release/cuda-optimized

Forked from TheTom/llama-cpp-turboquant (feature/turboquant-kv-cache branch), which provides the Metal implementation, turbo4 resurrection, asymmetric K/V discovery, norm correction, block-128 storage research, and sparse V concept. This work builds on @TheTom's foundation with CUDA-specific kernel optimizations for llama.cpp KV cache compression.

Key results (RTX 5090, Qwen 3.5 27B Q6_K):

turbo2 beats q8_0 by 5.4% at 32K (58.61 vs 55.60 tok/s) at 7.5× compression
+13-69% decode improvement at 32K from kernel optimizations (verified on 4 GPUs)
turbo2 at 256K: 42.57 tok/s on consumer 5090 with Q4_K_M weights — context where q8_0/f16 OOM
Zero PPL impact from sparse V skip optimization (proven by control test)
NIAH: q8_0/turbo3/turbo2 100% on 5090; q8_0/turbo3 100% on 3090 & 4090M; all types 92% on 3090 Ti
Cross-GPU validated: 4 GPUs (SM86×2/SM89/SM120), 1,351+ iterations

1. Decode Performance (RTX 5090, 27B Q6_K)

Type	bpv	Compression	Short (tg128)	32K (d=32768)	vs q8_0 32K
q8_0	8.5	1.88×	63.40	55.60	baseline
turbo4	4.25	3.76×	63.70	56.73	+2.0%
turbo3	3.125	5.12×	63.55	55.84	+0.4%
turbo2	2.125	7.53×	65.50	58.61	+5.4%
turbo1.5	2.00	8.0×	63.13	55.16	-0.8%

Measured with llama-bench -d 32768 (tg128 @ depth), ±0.3% variance. turbo2 is the 32K champion at +5.4% vs q8_0 despite 7.53× compression. All types match q8_0 at short context.

2. Prefill Context Scaling (tok/s)

Context	q8_0	turbo4	turbo3	turbo2	turbo1.5
pp512	3,512	3,548	3,547	3,649	3,577
pp4096	3,457	3,494	3,495	3,452	3,467
pp8192	3,390	3,390	3,414	3,394	3,394
pp16384	3,347	3,304	3,304	3,304	3,304
pp32768	2,839	2,815	2,801	2,805	2,808

Prefill auto-dequants turbo→fp16 and uses MMA/TILE kernels. All types track q8_0 with negligible overhead.

3. Quality: Perplexity (wikitext-103, 50 chunks)

Type	ctx=512	ctx=2048	ctx=8192	ctx=32K
q8_0	6.1825	6.4595	7.5932	8.9819
turbo4	6.2299 (+0.77%)	6.4940 (+0.53%)	7.6858 (+1.22%)	9.1930 (+2.35%)
turbo3	6.2300 (+0.77%)	6.5491 (+1.39%)	7.7308 (+1.81%)	9.2373 (+2.84%)
turbo2	6.3908 (+3.37%)	6.7675 (+4.77%)	8.2055 (+8.06%)	10.0536 (+11.93%)
turbo1.5	6.5745 (+6.34%)	7.0872 (+9.72%)	8.6123 (+13.42%)	11.1275 (+23.88%)

turbo3/turbo4 stay within 3% of q8_0 even at 32K. turbo2/turbo1.5 delta grows with context — inherent to 2-bit quantization (proven by threshold control test: PPL is bit-identical at 1e-2 and 1e-6 thresholds).

Note: PPL baseline differs between wikitext-103 (used here, q8_0=6.1825) and wikitext-2 (used in README, q8_0=6.759) — this is expected from different evaluation corpora. Relative ranking is consistent (turbo4 < turbo3 < turbo2 < turbo1.5), though absolute deltas differ between corpora.

4. Quality: KL Divergence vs f16 (100 prompts, 27B Q6_K)

Type	KLD	Top-1 Agreement	Delta-p RMS
q8_0	0.000408	100.0%	0.0153
turbo4	0.006485	99.0%	0.0488
turbo3	0.012495	93.0%	0.0664
turbo2	0.032700	91.0%	0.1146
turbo1.5	0.062681	88.0%	0.1502

5. Sparse V Skip Optimization

Skips V dequantization for positions with negligible attention weights. Control test proves zero quality impact:

Metric	Sparse V ON	Sparse V OFF	Delta
turbo3 PPL ctx=512	6.7251	6.7251	0.000
turbo3 32K speed	+4.6%	baseline	+4.6%

Skip rates (Qwen3-1.7B, direct attention weight measurement):

Threshold 1e-6: 9.1% at 512, 20.7% at 2K, 28.4% at 4K
Threshold 5e-3 (turbo3/4): 96.8% at 512, 99.3% at 2K
Threshold 1e-2 (turbo2/1.5): 98.4% at 512, 99.7% at 2K

6. Norm Correction (TheTom turbo3 + spiritbuun turbo4)

Rescales reconstructed vector norm to match original magnitude. Impact:

Metric	With Correction	Without	Improvement
turbo3 PPL ctx=512	6.7251	6.7494	-0.36%

7. Asymmetric K/V Quality Matrix (PPL ctx=512)

K \ V	q8_0	turbo4	turbo3	turbo2
q8_0	6.6395	6.6935	6.6885	6.8630
turbo4	6.6580	6.7102	6.7088	6.8821
turbo3	6.6698	6.7259	6.7251	6.8849
turbo2	6.8168	6.8687	6.8429	7.0396

V type dominates PPL (columns vary more than rows). K=turbo3/V=turbo3 ≈ K=q8_0/V=turbo3.

8. Impact of CUDA Kernel Optimizations

Measured by comparing the base TurboQuant implementation (feature/turboquant-kv-cache branch) against the optimized fork on the same GPU, same model, back-to-back. All speed with -d flag (tg128 @ depth).

RTX 5090 (27B Q6_K)

Type	Before	After	Improvement
Short (all types)	63-65	63-65	~tie
turbo4 32K	38.88	56.73	+45.9%
turbo3 32K	46.62	55.84	+19.8%
turbo2 32K	51.69	58.61	+13.4%

RTX 3090 (9B Q8_0)

Type	Before	After	Improvement
q8_0 32K	56.91	61.0	+7.2%
turbo4 32K	35.63	60.28	+69%
turbo3 32K	44.79	56.82	+27%
turbo2 32K	53.21	63.12	+19%
turbo3 64K	33.43	49.27	+47%
turbo2 64K	42.45	56.91	+34%

RTX 4090M (9B Q8_0)

Type	Before	After	Improvement
Short (all types)	55-56	55-56	~tie
q8_0 32K	48.2	52.0	+8%
turbo4 32K	34.5	52.4	+52%
turbo3 32K	40.3	49.0	+22%
turbo2 32K	44.9	52.7	+17%

Pattern: Short context is identical (weight-loading bound). Optimizations show at 32K+ where KV bandwidth dominates: +13-69% improvement across 4 GPUs. Advantage grows with context depth (64K: +34-47%). turbo4 benefits most because its larger KV amplifies the unoptimized dequant cost.

PPL (3090): q8_0 identical (9.3731). turbo4/turbo3 within noise. Optimized turbo2 PPL = 9.61 vs base 9.84 (2.3% better).

9. Per-GPU Performance Details

RTX 3090 Ti (SM86, OC +2200 mem, Qwen 3.5 9B Q8_0)

Decode speed (-d flag, tg128, ±0.3% variance):

Type	Short	32K	64K
q8_0	91.01	77.44	OOM
turbo4	90.03	75.55	OOM
turbo3	90.35	75.01	61.47
turbo2	90.75	81.58	72.79
turbo1.5	90.13	74.85	63.44

turbo2 beats q8_0 by 5.3% at 32K (81.58 vs 77.44). turbo2 64K = 72.79 where q8_0 OOMs.

PPL (base vs optimized): q8_0 identical (8.5249). turbo3 within noise (+0.2%).

NIAH (25 tests, 4K-64K, max_tokens=4000): q8_0=turbo3=turbo2=92%, turbo1.5=100%.

4090M (SM89, 16 GB, Qwen 3.5 9B Q8_0)

Speed in Section 8 above. PPL (base vs optimized): q8_0 identical (9.3737). turbo3/turbo4 within noise.

NIAH (max_tokens=4000): q8_0=turbo3=100%, turbo2=95%, turbo1.5=50%.

10. Cross-GPU Validation (1,351+ iterations, zero failures)

GPU	SM	VRAM	Total Iterations	PPL Checks
RTX 5090	SM120	32 GB	340+	Continuous
RTX 3090 Ti (OC)	SM86	24 GB	486+	48/48 bit-exact
RTX 3090	SM86	24 GB	100+	PPL bit-exact
RTX 4090M	SM89	16 GB	425+	14+ bit-exact
Total			1,351+	62+ bit-exact

11. NIAH (Needle-in-a-Haystack)

RTX 5090 (Qwen 3.5 9B Q8_0, max_tokens=4000):

Type	4K	8K	16K	32K	Overall
q8_0	100%	100%	100%	100%	100%
turbo3	100%	100%	100%	100%	100%
turbo2	100%	100%	100%	100%	100%

Earlier testing data (q8_0=85%, turbo3=70%) was caused by Qwen 3.5 thinking model exhausting max_tokens=1000. Raising to 4000 resolves all token exhaustion failures.

RTX 3090 Ti (25 tests 4K-64K, max_tokens=4000): q8_0=turbo3=turbo2=92%, turbo1.5=100%. With sufficient token budget, all types converge — remaining failures at 32K/64K depth 10% are model-specific, not turbo degradation.

RTX 3090 (20 tests 4K-32K, max_tokens=4000): q8_0=100%, turbo3=100%, turbo2=95% (1 failure at 32K/10%), turbo1.5=60%.

RTX 4090M (20 tests 4K-32K, max_tokens=4000): q8_0=100%, turbo3=100%, turbo2=95%, turbo1.5=50%. Previous 90% scores at max_tokens=2000 were token exhaustion — resolved with 4000. turbo1.5 improved from 35% to 50% but remains degraded (real 2-bit quality issue).

12. Optimizations Applied

Optimization	Impact
q8_1 Q + LUT scoring (8-wide turbo3/turbo2)	turbo3 32K +7%, turbo2 +13%
nthreads_KQ=8 for all types	turbo4 +17.7% at 32K
`__expf` fast-math softmax	+3.69% at 32K
Constexpr centroids	Compiler register-allocates
Sparse V skip (5e-3 / 1e-2)	+4.6% at 32K, zero PPL cost
Block-128 storage	5.12× turbo3, 7.53× turbo2
`__launch_bounds__(128,3)`	+7-13% at 32K
Norm correction (TheTom + spiritbuun)	-0.36% PPL

13. Limitations

D∈{64, 128, 256} only: VEC kernel requires D % 64 == 0. D=96 falls back to non-FA mul_mat (slower but correct)
SM120 D=256 LUT disabled: NVIDIA codegen bug (NVBUG 5218000/5288270). LUT works on SM86/SM89
2-bit types at long context: turbo2/turbo1.5 PPL delta grows with context length (inherent to quantization, not threshold)
VEC kernel ceiling: 168 registers, 98.4% utilization. 37 failed optimization attempts confirm no further code-level gains possible

14. Configuration Recommendations

Use Case	Recommended	Why
Best quality	turbo4	3.76× compression, lowest KLD (+0.77% PPL)
Best balance	turbo3	< 3% PPL delta at 32K, 5.12× compression
Max compression	turbo2	7.53× compression, fastest at 32K
Extreme compression	turbo1.5	8× compression, usable at short context
Long-context turbo2	turbo2 + Boundary V	Auto-enables mode 12 (first4+last4 q8_0-V)

15. Attribution

TurboQuant paper: Google Research (ICLR 2026, arXiv 2504.19874)
TheTom: Metal implementation, turbo4 resurrection (7 bugs), asymmetric K/V discovery, turbo3 norm correction, block-128 storage research, sparse V concept, quality validation
signalnine: Original CUDA port (PR Add missing headers for memcpy and assert #3 to TheTom's repo), InnerQ equalization, pre-rotate-queries architecture
spiritbuun: turbo4 norm correction (separate CUDA fork), inverse FWHT prefill
HyperionMS2040: Block-128 SET_ROWS fix
Madreag: CUDA kernel optimizations (LUT, sparse V, __expf, nthreads_KQ=8, constexpr centroids, block-128 validation, D=256 workaround, Q_reg fix, auto Boundary V), cross-GPU validation (1,351+ iterations)

Repo: Madreag/turbo3-cuda branch release/cuda-optimized

2 replies

TheTom Apr 1, 2026

Congrats on the release, this is incredible work. +69% on the 3090 at 32K is great. The asymmetric K/V matrix independently confirming V dominance from CUDA is exactly the kind of cross-validation we need.

Thank you for the attribution, really appreciate it. Let's keep it up team!

signalnine Apr 1, 2026

Was able to validate the optimizations on my 5090 rig. Great work!

yajo · 2026-04-01T08:07:59Z

yajo
Apr 1, 2026

Amazing.

As we move into the experimental phase with these kernels, I’ve been reflecting on how TurboQuant could serve as the 'structural glue' for an even broader efficiency stack. Would this combination make sense in the future?

BitNet (1.58b) at the training phase: Ternary-native models using int sums instead of float multiplications.
TurboQuant at the fine-tuning phase: Ensuring that data (especially the KV Cache) flows between memory and compute at sub-3bit precision without bottlenecks.
LFM (Liquid Foundation Models) as the 'Brain Logic': Moving away from static Transformer blocks toward a continuous, fluid state. This could theoretically allow for near-infinite context with sub-linear memory growth.

Each of these pieces multiply a lot models performance. If we eventually combine all those, we might be looking at a significant leap in local LLM feasibility. Would this potentially let us fit 70B+ models into commodity hardware with let's say 16GB VRAM, without losing much speed or intelligence?

Quick AI-assisted calculations lead me to 2000x performance improvements. Seems a bit mind-blowing. Am I oversimplifying that synergy? Would llama.cpp be able to run all this? Has anyone else explored this 'Full-Stack Efficiency' horizon?

2 replies

zcattacz Apr 1, 2026

Speaking of bitnet, their claim alone sounds drastic enough to make commodity hardware viable.

By that measure, 1-bit Bonsai 8B achieves an intelligence density score of 1.06/GB. Among nearby models by parameter-count, the closest, Qwen3 8B scores 0.10/GB.
https://prismml.com/news/bonsai-8b

Xcc313r4n7 Apr 1, 2026

Amazing.

As we move into the experimental phase with these kernels, I’ve been reflecting on how TurboQuant could serve as the 'structural glue' for an even broader efficiency stack. Would this combination make sense in the future?
* **BitNet (1.58b)** at the training phase: Ternary-native models using int sums instead of float multiplications.

* **TurboQuant** at the fine-tuning phase: Ensuring that data (especially the KV Cache) flows between memory and compute at sub-3bit precision without bottlenecks.

* **LFM** (Liquid Foundation Models) as the 'Brain Logic': Moving away from static Transformer blocks toward a continuous, fluid state. This could theoretically allow for near-infinite context with sub-linear memory growth.
Each of these pieces multiply a lot models performance. If we eventually combine all those, we might be looking at a significant leap in local LLM feasibility. Would this potentially let us fit 70B+ models into commodity hardware with let's say 16GB VRAM, without losing much speed or intelligence?

Quick AI-assisted calculations lead me to 2000x performance improvements. Seems a bit mind-blowing. Am I oversimplifying that synergy? Would llama.cpp be able to run all this? Has anyone else explored this 'Full-Stack Efficiency' horizon?

You might find this interesting as well from the Kimi team:

https://arxiv.org/abs/2603.15031

lloyal-research · 2026-04-01T08:24:02Z

lloyal-research
Apr 1, 2026

Except, recurrent memory comes at a qualitative cost, I don't think moving away from transformers is necessary. We already support SSM/Hymba and you can run LFM style models today. Besides this can continuous context can also be emulated externally via continuous tree batching + pruning outside the model boundary - for instance, see https://docs.lloyal.ai/learn/introduction - uses kv seqIds for exactly that for static transformers.

…

On Wed, 1 Apr 2026 at 7:08 pm, Jairo Llopis ***@***.***> wrote: Amazing. As we move into the experimental phase with these kernels, I’ve been reflecting on how TurboQuant could serve as the 'structural glue' for an even broader efficiency stack. Would this combination make sense in the future? - *BitNet (1.58b)* at the training phase: Ternary-native models using int sums instead of float multiplications. - *TurboQuant* at the fine-tuning phase: Ensuring that data (especially the KV Cache) flows between memory and compute at sub-3bit precision without bottlenecks. - *LFM* (Liquid Foundation Models) as the 'Brain Logic': Moving away from static Transformer blocks toward a continuous, fluid state. This could theoretically allow for near-infinite context with sub-linear memory growth. Each of these pieces multiply a lot models performance. If we eventually combine all those, we might be looking at a significant leap in local LLM feasibility. Would this potentially let us fit 70B+ models into commodity hardware with let's say 16GB VRAM, without losing much speed or intelligence? Quick AI-assisted calculations lead me to 2000x performance improvements. Seems a bit mind-blowing. Am I oversimplifying that synergy? Would llama.cpp be able to run all this? Has anyone else explored this 'Full-Stack Efficiency' horizon? — Reply to this email directly, view it on GitHub <#20969?email_source=notifications&email_token=BZQNGKHSJFX3FF4D3NTJJTD4TTFAZA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRUGA2TGNBQUZZGKYLTN5XKM3LBNZ2WC3FFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-16405340>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BZQNGKGGC3V5AFXBGTYH4DL4TTFAZAVCNFSM6AAAAACW6IA4XKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNBQGUZTIMA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

0 replies

AmesianX · 2026-04-01T11:50:31Z

AmesianX
Apr 1, 2026

TurboQuant v1.3.0 — Bulletproof head_dim Detection + All Reported Issues Fixed

Hi all, v1.3.0 addresses every issue reported in this thread. > Build Notice: Previous v1.3.0 binaries had incorrect CUDA architecture (SM 52 default) and missing CUDA runtime DLLs on Windows. All binaries have been pulled and are being rebuilt with SM 75/80/86/89/90/120 support. New binaries uploading soon. Apologies for the inconvenience.

Release: https://github.com/AmesianX/TurboQuant/releases/tag/v1.3.0

Issues Resolved

Issue	Reporter	Status
Phi-4, DeepSeek auto head_dim detection failure	@fritolays	Fixed — P1→P5 priority cascade
turbo4-K PPL explosion (18,202 on Cydonia-24B)	@TheTom	Not reproduced — see note below
GLM head_dim=576, Qwen3-4B head_dim=80	@fritolays, @sztlink	Fixed — pow2 check + q8_0 fallback
"////" output on Qwen3.5-27B-UD	@modderBUG	Not reproducible on v1.3.0
llama-bench TBQ types not accepted	@sztlink	Fixed — 16 types + 4 shorthands
Windows OpenSSL DLL dependency	@sztlink	Fixed — standalone builds (no external DLLs)

Key Change: head_dim Detection Completely Redesigned

The old code only read {arch}.attention.key_length from GGUF metadata. Most models don't store this key, so TurboQuant silently disabled itself on Phi-4, DeepSeek, and many others.

New detection uses a 5-level priority cascade with cross-validation:

P1: attention.key_length (100% — GGUF authoritative)
P2: attention.key_length_mla (100% — MLA models like DeepSeek V2)
P3: attention.key_length_swa (100% — SWA models like Gemma 2/3)
P4: attention.value_length (95% — cross-check)
P5: n_embd / n_head (70% — fallback, unreliable for MoE)

All signals are logged for diagnostics:

TurboQuant head_dim signals — key=128 val=128 computed=64 mla_k=0 mla_v=0 swa_k=0
[P1✓ P5✗] key_length=128 but n_embd/n_head=64 — using P1

Critical Discovery: n_embd/n_head is WRONG for many models

Model	n_embd/n_head	Actual head_dim	Consequence without P1
Qwen3-30B-A3B (MoE)	64	128	Wrong WHT block size → garbage
Qwen3.5-27B	213	256	TurboQuant disabled entirely

turbo4-K PPL explosion was not reproduced in our fork. Our turbo4-K gives PPL 6.73 (+7.5% vs F16 6.26) on Qwen3-30B-A3B — completely normal. After reviewing @TheTom's turbo4-resurrection paper, the PPL explosion in his fork was caused by 7 kernel-level bugs (SET_ROWS turbo3/turbo4 packing mismatch, missing QJL steps, etc.), not head_dim misdetection. Our implementation was built independently and does not share these bugs.

Benchmark (v1.3.0, Qwen3-30B-A3B Q4_K_M, DGX Spark GB10)

Config	PPL	vs F16
f16/f16	6.26	baseline
tbq4/tbq4	6.73	+7.5%
tbq3/tbq3	8.49	+35.6%

@fritolays @TheTom @sztlink @modderBUG — would appreciate retesting with v1.3.0 on your hardware once builds are up. The standalone Windows build (-standalone.zip) no longer requires OpenSSL DLLs.

Build Notice: Previous v1.3.0 binaries had incorrect CUDA architecture (SM 52 default) and missing CUDA runtime DLLs on Windows. All binaries have been pulled and are being rebuilt with SM 75/80/86/89/90/120 support. New binaries uploading soon. Apologies for the inconvenience.

Release: https://github.com/AmesianX/TurboQuant/releases/tag/v1.3.0

4 replies

dagbdagb Apr 1, 2026

New detection uses 6 signals with strict priority cascade:

[proceeds to list 5 signals]

Is the comment incorrect, or the list short of one signal?

AmesianX Apr 1, 2026

@dagbdagb Good catch — fixed. It's a 5-level priority cascade (P1→P5), each cross-validated against other available signals. Updated the comment. Thanks!

AmesianX Apr 1, 2026

@sztlink Good news — v1.3.0 handles this. head_dim=80 (Qwen3-4B) now gets caught by the power-of-2 validation and falls back to q8_0 with a clear warning instead of silently failing. The WHT kernel still requires pow2 dimensions, but at least users get an explicit message and the model runs normally with standard quantization.

Also, your RTX 4090 benchmark on Qwen3-30B-A3B is great data — thank you. With v1.3.0 the head_dim detection is much more robust (P1→P5 cascade caught that n_embd/n_head=64 is wrong for this model, actual head_dim=128). Would be great to see your numbers with the new build when you get a chance.

AmesianX Apr 1, 2026

@yajo Interesting idea. TurboQuant operates on the KV cache (inference-time activation compression), while BitNet operates on model weights (training-time). They are orthogonal — in principle you could have BitNet weights + TurboQuant KV cache. The challenge would be that BitNet models already have very low-precision activations, so the WHT rotation noise budget is tighter. Worth exploring but no data yet.

AmesianX · 2026-04-01T12:25:09Z

AmesianX
Apr 1, 2026

v1.3.0 Benchmark: Qwen3.5-27B Distilled Model (Claude 4.6 Opus Reasoning)

@TheTom reported turbo4-K PPL 8.22 on "Qwen3.5-27B distill" — here is our retest with v1.3.0 (P1→P5 head_dim fix applied).

Model: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled Q4_K_M
Hardware: DGX Spark GB10 (NVIDIA GB10, 128GB unified memory)
head_dim: 256 (TBQP uses QJL correction at this dimension)

PPL + KV Memory + Speed

Config	PPL	vs F16	KV Memory	Compress	Prompt t/s	Gen t/s
f16/f16	6.16	baseline	8,192 MiB	1.0x	94.6	11.6
tbqp4/tbq4	6.20	+0.6%	2,096 MiB	3.9x	81.7	11.1
tbqp3/tbq3	6.23	+1.1%	1,584 MiB	5.2x	79.9	11.2
tbq4/tbq4	7.01	+13.8%	—	—	—	—
tbq3/tbq3	7.11	+15.4%	—	—	—	—

Key finding: tbqp3/tbq3 at 5.2x compression with only +1.1% PPL — essentially lossless at head_dim=256 with QJL. This confirms turbo4-K works correctly in our independent implementation. The PPL explosion @TheTom reported in his fork was traced to 7 kernel-level bugs (see his turbo4-resurrection paper), not head_dim misdetection. Our fork does not share these bugs.

Scientist Name Transliteration Test ("Pauli Test")

Korean has an official national standard (외래어 표기법) for transliterating foreign names — exactly ONE correct answer per name. This makes it an extremely sensitive test for attention precision.

Config: tbqp3/tbq3 (5.2x compression)

Scientist	Korean Output	Standard	Result
Wolfgang Pauli	볼프강 파울리	볼프강 파울리	✅
Niels Bohr	니엘스 보어	닐스 보어	⚠️
Erwin Schrödinger	에르빈 슈뢰딩거	에르빈 슈뢰딩거	✅
Werner Heisenberg	베르너 하이젠베르크	베르너 하이젠베르크	✅
Max Planck	막스 플랑크	막스 플랑크	✅
Enrico Fermi	에리코 페르미	엔리코 페르미	⚠️

4/6 exact, 2/6 minor variant. The two ⚠️ cases (니엘스→닐스, 에리코→엔리코) are commonly used alternative transliterations but not the official Korean standard (외래어 표기법). This is a base model characteristic (Qwen3.5 is a Chinese-origin LLM with less exposure to Korean linguistic standards), not a TurboQuant degradation — the F16 baseline produces identical output. Sharp attention peaks are fully preserved at 5.2x compression.

Release: https://github.com/AmesianX/TurboQuant/releases/tag/v1.3.0

3 replies

slashedstar Apr 1, 2026

Is the ""////" output on Qwen3.5-27B-UD" about the model getting stuck in a repetitive loop? Because I get that too on Win 11 RTX 4090 with

.\llama-server.exe -m "D:\models\llm\Qwen3.5-27B-IQ4_XS.gguf" --cache-type-k tbqp3 --cache-type-v tbq3 --no-mmap

Sometimes its "!!!!!!!!!!!!!!!!" but mostly "???????????????", same with OpenReasoning-Nemotron-32B-IQ4_XS; Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL oddly stops mid generation, shisa-v2.1-unphi4-14b_Q8_0 and EXAONE-4.0-32B.i1-IQ4_XS.gguf seemed to work fine,

AmesianX Apr 1, 2026

@slashedstar Could you try adding --flash-attn on to your command and see if the issue persists? TurboQuant KV cache dequantization runs inside the flash attention kernel, so it needs to be enabled. Example:

.\llama-server.exe -m model.gguf --cache-type-k tbqp3 --cache-type-v tbq3 --flash-attn on --no-mmap

Let us know if the repetitive output still happens with flash attention on.

slashedstar Apr 1, 2026

@slashedstar Could you try adding --flash-attn on to your command and see if the issue persists? TurboQuant KV cache dequantization runs inside the flash attention kernel, so it needs to be enabled. Example:
.\llama-server.exe -m model.gguf --cache-type-k tbqp3 --cache-type-v tbq3 --flash-attn on --no-mmap
Let us know if the repetitive output still happens with flash attention on.

Yeah same behaviour with flash-attn on, even without the flag I believe it was auto-detecting it and enabling:
llama_context: flash_attn = auto
sched_reserve: Flash Attention was auto, set to enabled
with it explicitly set to off it doesn't even work, with K quantization only it exits with no error during the empty run warmup and with V quantization it says flash is required and fails to create context.

sjoerdmaessen · 2026-04-01T17:23:34Z

sjoerdmaessen
Apr 1, 2026

Test on Ada Lovelace (sm_89) + Large MoE (122B) Results

Tested TurboQuant on hardware and model size not yet covered by the community it seems. Built from @TheTom's fork (feature/turboquant-kv-cache branch) with -DCMAKE_CUDA_ARCHITECTURES=89.

Model: Qwen3.5-122B-A10B Q5_K_S (Unsloth imatrix, 86.4GB, 3 split GGUFs)
Hardware: 2x NVIDIA L40S 48GB (sm_89, Ada Lovelace), AMD EPYC 9354P, 512GB DDR5
head_dim: 256 (TBQP uses QJL - relevant for head_dim validation)
Architecture: Hybrid MoE - 12 attention layers (GQA) + 36 recurrent layers (Gated DeltaNet), 256 experts, 10B active/token

1. KV Cache Size (1x82K context, 12 attn layers)

Type	Total KV	K	V	vs q8_0
q8_0	1,045.50 MiB	522.75	522.75	baseline
turbo4	522.75 MiB	261.38	261.38	2.00x
turbo3	384.38 MiB	192.19	192.19	2.72x

Only 12 of 48 layers use KV cache (the rest are recurrent), so absolute sizes are smaller than a pure transformer, but compression ratios hold per-layer.

2. Decode Performance (2x L40S, 82K ctx)

Config	PP (tok/s)	TG (tok/s)	vs q8_0 TG
q8_0, 1x82K	258.5	61.1	baseline
turbo3, 1x82K	310.8	57.2	-6.4%
turbo4, 1x82K	216.5	56.9	-6.9%
turbo3, 2x82K, single req	219.5	57.3	-6.2%
turbo3, 2x82K, 2 concurrent	160-165 each	44.0 each (88 total)	+44% throughput

I used the exact same prompt for all tests.
Settings used: max_tokens=512, temp=0.6. It seemst the TG penalty is around 6% for turbo3 and turbo4. PP is faster with turbo3 (less cache data). Concurrent dual-slot has 1.44x aggregate throughput.

3. Dual-Slot 2x82K — The Unlock

Our production config (q8_0, 1x82K + mmproj) left only 186 MiB free the first GPU. turbo3 changes this:

Config	KV Cache	RS State	GPU0 Free	GPU1 Free	Status
q8_0, 1x82K + mmproj	1,045 MiB	149 MiB	266 MiB	3,576 MiB	production
turbo3, 2x82K, no mmproj	769 MiB	298 MiB	658 MiB	2,872 MiB	works
turbo3, 2x82K + mmproj	769 MiB	298 MiB	—	—	OOM (mmproj needs 867 MiB on GPU0)

TurboQuant rotation matrices initialized correctly at 128x128. Both slots verified with concurrent requests, no cross-contamination.

4. VRAM Summary

Config	GPU0 Used/Total	GPU1 Used/Total
q8_0, 1x82K + mmproj	45,193 / 46,068	41,882 / 46,068
turbo3, 2x82K	44,801 / 46,068	42,586 / 46,068
turbo4, 1x82K	44,599 / 46,068	42,392 / 46,068

Notes

sm_89 (Ada Lovelace): confirmed working - turbo2/3/4 all available, no arch-specific issues
122B MoE with 256 experts: my main model tested with TurboQuant
head_dim=256: rotation matrices init correctly, no issues observed
Correctness spot-checked (factual queries, counting, primes) - all correct
Did not run llama-perplexity - I couldnt run this, the model is too large for quick PPL runs alongside production workload
Cross-GPU validated: KV buffers split correctly across both L40S GPUs (384.50 + 384.38 MiB for turbo3 2x82K)
Special note on multimodal: dual-slot 2x82K + mmproj (vision projector, ~867 mb) fails with OOM on the first GPU, only 658 MiB free after KV+RS allocation. The mmproj always allocates on GPU0. Our production baseline (q8_0, 1x82K + mmproj) works fine. The OOM only occurs when combining dual-slot turbo3 with mmproj.

Happy to run additional tests if useful, jsut let me know, for example specific context lengths, turbo2 results or other stuff.

9 replies

TheTom Apr 1, 2026

This is exactly what we needed. V compression free all the way down to 2-bit, K-side is the only source of decode penalty. Clean confirmation across turbo2/3/4. The whole family is here!

You are right that the 29 MiB gap between turbo2 and turbo3 is small on this model because only 12/48 layers use KV cache. On a pure transformer (all layers have KV), the gap scales linearly with layer count. A 27B dense model at 82K context would show a much larger absolute savings from turbo2 vs turbo3.

Thank you for running all of these. This is some of the most valuable production data the project has received. Hope you enjoy the extra headroom. More to come.

sjoerdmaessen Apr 1, 2026

Just one more follow-up: MTMD_BACKEND_DEVICE=CUDA1 was the missing ingredient for my setup.

After your turbo2 suggestion I dug into the mmproj OOM and discovered this env var (in PR #14236). Moving the vision part from GPU0 to GPU1 frees ~870 MiB on GPU0 which removed the entire bottleneck.

Result: 2x128K dual-slot + mmproj + vision

Config	Ctx/slot	Slots	Vision	TG (single)	TG (concurrent)	GPU0 free	GPU1 free
Previous (q8_0/q8_0)	82K	1	yes	61.1	—	266 MiB	3,576 MiB
q8_0-K/turbo3-V + CUDA1 mmproj	128K	2	yes	62.2	45+45=90	298 MiB	1,612 MiB

From 1x82K single-slot to 2x128K dual-slot. Full model native context on both slots. Vision works, no speed penalty. This is now my new production config.

Context scaling with MTMD_BACKEND_DEVICE=CUDA1

Ctx/slot	GPU0 free	GPU1 free	Status
2x80K	1,150 MiB	2,222 MiB	comfortable
2x92K	934 MiB	2,068 MiB	comfortable
2x104K	722 MiB	1,916 MiB	fine
2x120K	438 MiB	1,712 MiB	fine
2x128K	298 MiB	1,612 MiB	production (comparable to previous 1x82K headroom)

Asymmetric q8_0/turbo2 (your earlier request)

Config	KV Size	PP (t/s)	TG (t/s)	vs q8_0 TG
q8_0 / q8_0	1,046 MiB	258.5	61.1	—
q8_0 / turbo3	715 MiB	294.5	61.1	0.0%
q8_0 / turbo2	686 MiB	295.9	61.3	+0.3%

I went with q8_0-K/turbo3-V for production (slightly more conservative) but turbo2-V would work equally well it seems.

Key insight

The real unlock was this in combination with the asymmetric KV and that MTMD_BACKEND_DEVICE=CUDA1 env setting. Together this now took me from 1x82K to 2x128K with no compromises.

TheTom Apr 1, 2026

Good to see. From 1x82K single-slot to 2x128K dual-slot with vision, no speed penalty, no quality compromise. Full native context on both slots.

The MTMD_BACKEND_DEVICE=CUDA1 find is a great catch. That 19 MiB miss from the previous test was bugging me and you found the real fix. Moving mmproj to GPU1 is the kind of system-level thinking that turns benchmarks into production configs.

Thank you for documenting the full context scaling table. Seeing the headroom at each step from 2x80K to 2x128K gives real confidence in the stability of this setup.

Going to reference this in our asymmetric K/V paper as the strongest production validation to date. 122B, dual L40S, 2x128K, vision, zero decode penalty.

Thanks for the finding I'll update my documents.

sjoerdmaessen Apr 2, 2026

Correction: asymmetric q8_0-K / turbo3-V produces corrupt output

When deploying the asymmetric config to production (Charles, our Matrix chatbot), we noticed empty/garbled responses. Investigation revealed the model outputs literal U+003F (?) characters, not a rendering issue, actual ASCII question marks. This affects both direct llama-server requests and requests through LiteLLM.

The speed measurements I reported earlier are accurate (61.1 t/s, zero TG penalty) the model generates tokens at full speed, they're just wrong tokens. I only caught this when checking actual content in production because my latest benchmarks were speed-only and maybe a bit rushed because of excitement.

Symmetric turbo3/turbo3 works correctly. Same fork, same binary, switching from -ctk q8_0 -ctv turbo3 to -ctk turbo3 -ctv turbo3 immediately produces coherent output. Verified with factual queries, counting, reasoning in Dutch, all correct actual content.

This suggests the K and V turbo3 quantization paths need to be paired, mixing q8_0 K with turbo3 V breaks the attention computation, possibly because the rotation matrices are only applied to turbo3-quantized tensors.

What we're running now (production, verified working)

Config	Ctx/slot	Slots	Vision	TG	Content
~~q8_0-K / turbo3-V~~	—	—	—	61.1	CORRUPT
~~q8_0-K / turbo2-V~~	—	—	—	61.3	not verified, likely same issue
turbo3-K / turbo3-V	104K	2	yes	58	correct

The asymmetric config also has a much larger compute buffer on GPU0 for turbo3-K (2074 MiB vs 858 MiB for q8_0-K), which limits max context with symmetric turbo3 compared to asymmetric. The max / sweet spot now is 2x104K with MTMD_BACKEND_DEVICE=CUDA1 for the mmproj.

Updated production config

--cache-type-k turbo3 --cache-type-v turbo3
--ctx-size 212992 (2x104K)
--parallel 2
MTMD_BACKEND_DEVICE=CUDA1

2x104K dual-slot, vision on GPU1, cron jobs verified, interactive chat verified. 58 t/s single, ~6% slower than q8_0 baseline. Very happy with this as a production config nonetheless.

Sorry for the incorrect asymmetric data, lesson learned to always verify output content, not just generation speed.

TheTom Apr 2, 2026

Thank you for catching this.

I have been investigating this tonight and opened issue #47 to track it. After tracing through the code paths, I found two concrete bugs that affect asymmetric configs:

1. Rotation matrix initialization (llama-kv-cache.cpp:332) — The turbo rotation matrices and InnerQ scale tensors were only allocated when K is a turbo type. With -ctk q8_0 -ctv turbo3, these tensors were never created. Fixed to check both K and V types.

2. V unpad gated on K type (llama-graph.cpp, 3 instances) — Three V unpad blocks that extract the original head_dim after inverse WHT were checking k->type instead of v->type. With asymmetric q8_0-K, these blocks were skipped even when V is turbo and needs processing. This affects MHA, MLA, and ISWA attention paths.

Honest assessment: your model has head_dim=256 which is already 128-aligned, so the unpad is a no-op and bug #2 alone would not explain the corruption. Bug #1 ensures the rotation infrastructure exists, which may be the missing piece. Every other successful asymmetric test to date has been on head_dim=128 models, so yours is the first to exercise this combination. There may be additional issues I have not found yet.

I also added diagnostic logging to the fix so we can see exactly what code paths are being taken. When you run with the fix branch you should see lines like:

[DIAG #47] turbo rotation matrices allocated (K=q8_0, V=turbo3_0, asymmetric=YES)
[DIAG #47] build_attn il=0: K=q8_0 V=turbo3_0 asymmetric=YES Q_prerot=SKIP innerq_scale=allocated head_dim_v=256

Fix branch: fix/asymmetric-rotation-init

git fetch origin fix/asymmetric-rotation-init
git checkout fix/asymmetric-rotation-init

If it does not resolve the corruption, please paste back the [DIAG #47] log lines and any other startup output — that will tell me which paths are active and help narrow down where the issue is.

It is late here so I may not respond until morning, but wanted to get this in your hands as soon as possible.

nenkoru · 2026-04-01T18:37:44Z

nenkoru
Apr 1, 2026

Tested turboquant fork from @Madreag at 1766c91 on two vGPUs from Nvidia V100. GRID V100DX-32Q & GRID V100DX-8Q
In the following gist llama-bench output from unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL.
https://gist.github.com/nenkoru/890ccb088d62b4f59cc9c95b297fe69b

Tested the following combos:

turbo4/turbo4
turbo3/turbo3
turbo2/turbo2
turbo1.5/turbo1.5
turbo3/turbo4

0 replies

fritolays · 2026-04-01T23:26:57Z

fritolays
Apr 1, 2026

TurboQuant v1.3.0 - Model Sweep

@AmesianX here is the new set of tests, thanks again for the fast updates.
I've added a few more obscure models this time around.
Though they are already at max context for the most part.
But they do work.

Environment

OS: Windows 11 x64
GPU: NVIDIA RTX 4070 Ti 16GB, CUDA 12.8
CPU: AMD Ryzen 5900x (12c/24t), 128GB DDR4
Build: TurboQuant v1.3.0
Benchmark: 2048 token prompt, 512 token generation cutoff, averages of 3 runs
llama-cli args: --parallel 1 -t 12 -tb 24 -ngl 99 --verbosity 3 --simple-io --single-turn

Findings

v1.3.0 Head_dim Detection

The new 5-level priority cascade (P1→P5) resolves all previous auto-detection failures. Phi-4 and DeepSeek, which failed in v1.2.0, now work correctly. GLM-4.7-Flash (head_dim=576) is correctly detected but unsupported, falling back to f16.

Qwen3.5 Turbo V Quality Issue

Qwen3.5 models output only question marks when using turbo types on both K and V. Tested on 27B Q3_K_M and confirmed on 9B Q8_0 — issue is independent of weight quant level. Turbo K + q8_0 V works correctly. Other Qwen models (Qwen2.5-14B) are unaffected.

Context vs Speed Tradeoff

Analysed across models where TurboQuant provides meaningful context gains beyond q4_0 (DeepSeek-R1-14B, Gemma-3-27B, Magistral-Small-2509, Magistry-24B, Qwen3.5-27B, Llama3.2-24B-A3B):

Best context: tbq3_0/tbq3_0 delivers ~3.8–7.9x context over f16 baseline. Prompt processing drops ~50–75%, generation drops ~5–15%.

Best balance: tbq4_0/q8_0 delivers ~1.9–3.6x context with the least speed impact. Generation stays at 95–99% of baseline (or faster on memory-bandwidth-bottlenecked models), prompt processing retains 63–89% of baseline.

Architecture Summary

Model	Architecture	head_dim	Max Context	f16 Context	Best Turbo Context	Best Multiplier	Status
Models with meaningful TurboQuant gains
Gemma-3-27b-it-heretic-v1.2.i1-IQ3_M	Gemma	128	131,072	12,032	95,488	~7.94x	✅ Working
DeepSeek-R1-14B-Finetuned.i1-Q6_K	DeepSeek	128	131,072	14,592	74,240	~5.09x	✅ Working
Qwen3.5-27B-heretic-v3.i1-Q3_K_M	Qwen	256	262,144	17,408	84,736	~4.87x	⚠️ Turbo V outputs question marks — use turbo K + q8_0 V only
Llama3.2-24B-A3B-II-Dark-Champion.i1-Q4_K_M	Llama (MoE)	128	131,072	29,184	131,072	~4.49x	✅ Working — hits model max
Magistral-Small-2509-Heretic-v1.2.i1-Q4_K_M	Mistral	128	131,072	4,096	15,616	~3.81x	✅ Working
Magistry-24B-v1.1.i1-Q4_K_M	Mistral	128	131,072	4,096	15,616	~3.81x	✅ Working
Models capped at native max context (q4_0 already reaches cap)
Phi-4-reasoning-heretic.i1-Q6_K	Phi	128	32,768	14,080	32,768	~2.33x	✅ Capped — q4_0 also caps
Qwen2.5-14B-Instruct-Heretic.i1-Q6_K	Qwen	128	32,768	14,592	32,768	~2.25x	✅ Capped — q4_0 also caps
Gemma-3-12b-it-heretic.i1-Q6_K	Gemma	128	131,072	61,696	131,072	~2.12x	✅ Capped — q4_0 also caps
Llama-3.1-8B-Instruct-Heretic.i1-Q6_K	Llama	128	131,072	62,720	131,072	~2.09x	✅ Capped — q4_0 also caps
gpt-oss-20b-heretic-ara-v3.i1-MXFP4_MOE	gpt-oss	128	131,072	114,432	131,072	~1.15x	✅ Marginal — nearly at max with f16
Models already at max context with f16 — no benefit from TurboQuant
Falcon3-7B-Instruct-Heretic-v2.i1-Q6_K	Falcon	128	32,768	32,768	32,768	~1x	⬚ No headroom
Yi-1.5-9B-Chat-16K-abliterated.i1-Q6_K	Yi	128	16,384	16,384	16,384	~1x	⬚ No headroom
c4ai-command-r7b-12-2024-abliterated.i1-Q6_K	Command-R	256	8,192	8,192	8,192	~1x	⬚ No headroom
Unsupported
GLM-4.7-Flash-heretic-ayun.i1-IQ3_M	GLM/gpt-oss	576	202,752	24,320	24,320	~1x	❌ head_dim=576 unsupported — falls back to f16

Test Results

Click to view, caution its extensive...

Model (Notes)	Cache Type	Context	Context Multiple	Avg Prompt t/s	Avg Generation t/s
c4ai-command-r7b-12-2024-abliterated.i1-Q6_K	-ctk f16 -ctv f16	8192	Baseline	3300.4	77.7
	-ctk q8_0 -ctv q8_0	8192	~1x	3261.6	78.3
	-ctk q4_0 -ctv q4_0	8192	~1x	3291.2	77.5
	-ctk tbqp4_0 -ctv tbq4_0	8192	~1x	1139.3	65.9
	-ctk tbqp4_0 -ctv tbq3_0	8192	~1x	1126.4	65.2
	-ctk tbqp4_0 -ctv q8_0	8192	~1x	1415.3	70.4
	-ctk tbq4_0 -ctv tbq4_0	8192	~1x	1496.3	69.7
	-ctk tbq4_0 -ctv tbq3_0	8192	~1x	1299.8	69.8
	-ctk tbq4_0 -ctv q8_0	8192	~1x	2115.7	73.8
	-ctk tbqp3_0 -ctv tbq3_0	8192	~1x	1332.1	66.5
	-ctk tbqp3_0 -ctv tbq4_0	8192	~1x	1404.2	67.9
	-ctk tbqp3_0 -ctv q8_0	8192	~1x	1675.5	72.2
	-ctk tbq3_0 -ctv tbq3_0	8192	~1x	1062	67.7
	-ctk tbq3_0 -ctv tbq4_0	8192	~1x	1462	69.6
	-ctk tbq3_0 -ctv q8_0	8192	~1x	1649.3	73.3
DeepSeek-R1-14B-Finetuned.i1-Q6_K	-ctk f16 -ctv f16	14592	Baseline	1861.9	43.5
	-ctk q8_0 -ctv q8_0	27392	~1.88x	1832.9	45.1
	-ctk q4_0 -ctv q4_0	51712	~3.54x	1804.3	44.7
	-ctk tbqp4_0 -ctv tbq4_0	55552	~3.81x	635.1	38.2
	-ctk tbqp4_0 -ctv tbq3_0	62976	~4.32x	624.3	37.8
	-ctk tbqp4_0 -ctv q8_0	36608	~2.51x	789.3	40.7
	-ctk tbq4_0 -ctv tbq4_0	56320	~3.86x	830.3	41
	-ctk tbq4_0 -ctv tbq3_0	64256	~4.4x	716.7	40.3
	-ctk tbq4_0 -ctv q8_0	37120	~2.54x	1165.6	42.9
	-ctk tbqp3_0 -ctv tbq3_0	72704	~4.98x	737.4	38.4
	-ctk tbqp3_0 -ctv tbq4_0	62976	~4.32x	777	39.6
	-ctk tbqp3_0 -ctv q8_0	39680	~2.72x	933.6	41.9
	-ctk tbq3_0 -ctv tbq3_0	74240	~5.09x	593.3	39.5
	-ctk tbq3_0 -ctv tbq4_0	64256	~4.4x	809.9	40.3
	-ctk tbq3_0 -ctv q8_0	40192	~2.75x	923.1	42.2
Falcon3-7B-Instruct-Heretic-v2.i1-Q6_K	-ctk f16 -ctv f16	32768	Baseline	3731.3	82.8
	-ctk q8_0 -ctv q8_0	32768	~1x	3635.4	89.2
	-ctk q4_0 -ctv q4_0	32768	~1x	3647.3	88.7
	-ctk tbqp4_0 -ctv tbq4_0	32768	~1x	1355.5	64.6
	-ctk tbqp4_0 -ctv tbq3_0	32768	~1x	1069.4	62.6
	-ctk tbqp4_0 -ctv q8_0	32768	~1x	1440	69.7
	-ctk tbq4_0 -ctv tbq4_0	32768	~1x	2145.9	74.7
	-ctk tbq4_0 -ctv tbq3_0	32768	~1x	1558.7	71
	-ctk tbq4_0 -ctv q8_0	32768	~1x	2662.8	81.7
	-ctk tbqp3_0 -ctv tbq3_0	32768	~1x	1261.8	65.4
	-ctk tbqp3_0 -ctv tbq4_0	32768	~1x	1621.5	68.1
	-ctk tbqp3_0 -ctv q8_0	32768	~1x	1724.6	73
	-ctk tbq3_0 -ctv tbq3_0	32768	~1x	1268.1	69.4
	-ctk tbq3_0 -ctv tbq4_0	32768	~1x	1640.4	71.6
	-ctk tbq3_0 -ctv q8_0	32768	~1x	1770.7	77.8
gemma-3-12b-it-heretic.i1-Q6_K	-ctk f16 -ctv f16	61696	Baseline	1933.2	51.1
	-ctk q8_0 -ctv q8_0	122880	~1.99x	2016.6	51.4
	-ctk q4_0 -ctv q4_0	131072	~2.12x	2119.6	51.3
	-ctk tbqp4_0 -ctv tbq4_0	131072	~2.12x	732	37.3
	-ctk tbqp4_0 -ctv tbq3_0	131072	~2.12x	568.7	36.2
	-ctk tbqp4_0 -ctv q8_0	131072	~2.12x	769.1	39.4
	-ctk tbq4_0 -ctv tbq4_0	131072	~2.12x	1207.2	43.4
	-ctk tbq4_0 -ctv tbq3_0	131072	~2.12x	813.8	41.1
	-ctk tbq4_0 -ctv q8_0	131072	~2.12x	1505.6	46.2
	-ctk tbqp3_0 -ctv tbq3_0	131072	~2.12x	680.1	38.3
	-ctk tbqp3_0 -ctv tbq4_0	131072	~2.12x	893.7	39.6
	-ctk tbqp3_0 -ctv q8_0	131072	~2.12x	937.9	41.8
	-ctk tbq3_0 -ctv tbq3_0	131072	~2.12x	686.4	40.5
	-ctk tbq3_0 -ctv tbq4_0	131072	~2.12x	900	41.6
	-ctk tbq3_0 -ctv q8_0	131072	~2.12x	962.6	44.4
gemma-3-27b-it-heretic-v1.2.i1-IQ3_M	-ctk f16 -ctv f16	12032	Baseline	1006.6	35.4
	-ctk q8_0 -ctv q8_0	29952	~2.49x	1039.9	35.8
	-ctk q4_0 -ctv q4_0	63744	~5.3x	1078.3	35.9
	-ctk tbqp4_0 -ctv tbq4_0	69120	~5.74x	519.7	32.4
	-ctk tbqp4_0 -ctv tbq3_0	79616	~6.62x	522	32.5
	-ctk tbqp4_0 -ctv q8_0	42752	~3.55x	616.9	33.5
	-ctk tbq4_0 -ctv tbq4_0	70400	~5.85x	657.2	34
	-ctk tbq4_0 -ctv tbq3_0	81152	~6.74x	585.1	33.9
	-ctk tbq4_0 -ctv q8_0	43264	~3.6x	813.6	34.2
	-ctk tbqp3_0 -ctv tbq3_0	93440	~7.77x	561.9	28.3
	-ctk tbqp3_0 -ctv tbq4_0	79616	~6.62x	615.4	33.5
	-ctk tbqp3_0 -ctv q8_0	47104	~3.91x	684	34.2
	-ctk tbq3_0 -ctv tbq3_0	95488	~7.94x	496.6	33.5
	-ctk tbq3_0 -ctv tbq4_0	81152	~6.74x	576.9	33.8
	-ctk tbq3_0 -ctv q8_0	47616	~3.96x	689.2	34.3
GLM-4.7-Flash-ultimate-uncensored-heretic-ayun.i1-IQ3_M	-ctk f16 -ctv f16	24320	Baseline	751.9	107.8
	-ctk q8_0 -ctv q8_0	44288	~1.82x	726.8	104.1
	-ctk q4_0 -ctv q4_0	79616	~3.27x	630.2	103.7
(head_dim 576 issue)	-ctk tbqp4_0 -ctv tbq4_0	24320	~1x	707.3	104
(head_dim 576 issue)	-ctk tbqp4_0 -ctv tbq3_0	24320	~1x	669.3	104.6
(head_dim 576 issue)	-ctk tbqp4_0 -ctv q8_0	24320	~1x	677	104.1
(head_dim 576 issue)	-ctk tbq4_0 -ctv tbq4_0	24320	~1x	629.5	103.9
(head_dim 576 issue)	-ctk tbq4_0 -ctv tbq3_0	24320	~1x	743.9	104.5
(head_dim 576 issue)	-ctk tbq4_0 -ctv q8_0	24320	~1x	625.8	103.9
(head_dim 576 issue)	-ctk tbqp3_0 -ctv tbq3_0	24320	~1x	630.3	104
(head_dim 576 issue)	-ctk tbqp3_0 -ctv tbq4_0	24320	~1x	627.1	103.9
(head_dim 576 issue)	-ctk tbqp3_0 -ctv q8_0	24320	~1x	691.5	104.2
(head_dim 576 issue)	-ctk tbq3_0 -ctv tbq3_0	24320	~1x	762.4	104.2
(head_dim 576 issue)	-ctk tbq3_0 -ctv tbq4_0	24320	~1x	630.7	103.9
(head_dim 576 issue)	-ctk tbq3_0 -ctv q8_0	24320	~1x	631.7	98.8
gpt-oss-20b-heretic-ara-v3.i1-MXFP4_MOE	-ctk f16 -ctv f16	114432	Baseline	2332.2	160.9
	-ctk q8_0 -ctv q8_0	131072	~1.15x	2269	156.6
	-ctk q4_0 -ctv q4_0	131072	~1.15x	2298	157.2
	-ctk tbqp4_0 -ctv tbq4_0	131072	~1.15x	1939.7	149.7
	-ctk tbqp4_0 -ctv tbq3_0	131072	~1.15x	1869.9	153.3
	-ctk tbqp4_0 -ctv q8_0	131072	~1.15x	2448.1	157.4
	-ctk tbq4_0 -ctv tbq4_0	131072	~1.15x	1940.4	149.7
	-ctk tbq4_0 -ctv tbq3_0	131072	~1.15x	1875	153.6
	-ctk tbq4_0 -ctv q8_0	131072	~1.15x	2282.5	156.1
	-ctk tbqp3_0 -ctv tbq3_0	131072	~1.15x	1912.6	152.4
	-ctk tbqp3_0 -ctv tbq4_0	131072	~1.15x	1954.2	151.4
	-ctk tbqp3_0 -ctv q8_0	131072	~1.15x	2532.6	160.7
	-ctk tbq3_0 -ctv tbq3_0	131072	~1.15x	1897.9	154.6
	-ctk tbq3_0 -ctv tbq4_0	131072	~1.15x	1965.3	151.4
	-ctk tbq3_0 -ctv q8_0	131072	~1.15x	2536.2	158.2
Llama-3.1-8B-Instruct-Heretic.i1-Q6_K	-ctk f16 -ctv f16	62720	Baseline	3368.6	83.4
	-ctk q8_0 -ctv q8_0	117248	~1.87x	3305.2	82.6
	-ctk q4_0 -ctv q4_0	131072	~2.09x	3341.1	83.2
	-ctk tbqp4_0 -ctv tbq4_0	131072	~2.09x	1195.5	69.9
	-ctk tbqp4_0 -ctv tbq3_0	131072	~2.09x	1174.9	69.3
	-ctk tbqp4_0 -ctv q8_0	131072	~2.09x	1484.8	74.6
	-ctk tbq4_0 -ctv tbq4_0	131072	~2.09x	1558.2	74.1
	-ctk tbq4_0 -ctv tbq3_0	131072	~2.09x	1357.8	74.1
	-ctk tbq4_0 -ctv q8_0	131072	~2.09x	2211.2	78.7
	-ctk tbqp3_0 -ctv tbq3_0	131072	~2.09x	1391.8	70.3
	-ctk tbqp3_0 -ctv tbq4_0	131072	~2.09x	1466.2	72.1
	-ctk tbqp3_0 -ctv q8_0	131072	~2.09x	1745.4	77.1
	-ctk tbq3_0 -ctv tbq3_0	131072	~2.09x	1111.5	72.5
	-ctk tbq3_0 -ctv tbq4_0	131072	~2.09x	1528.1	73.6
	-ctk tbq3_0 -ctv q8_0	131072	~2.09x	1726.6	77.4
Llama3.2-24B-A3B-II-Dark-Champion-INSTRUCT-Heretic-Abliterated-Uncensored.i1-Q4_K_M	-ctk f16 -ctv f16	29184	Baseline	2721.9	167.7
	-ctk q8_0 -ctv q8_0	54528	~1.87x	2699.3	181.7
	-ctk q4_0 -ctv q4_0	101120	~3.46x	2616.9	176.7
	-ctk tbqp4_0 -ctv tbq4_0	108544	~3.72x	1348.7	135.7
	-ctk tbqp4_0 -ctv tbq3_0	122624	~4.2x	1371.7	134.5
	-ctk tbqp4_0 -ctv q8_0	72192	~2.47x	1554.5	149.7
	-ctk tbq4_0 -ctv tbq4_0	110080	~3.77x	1620.3	150.2
	-ctk tbq4_0 -ctv tbq3_0	124672	~4.27x	1512.5	149.6
	-ctk tbq4_0 -ctv q8_0	72960	~2.5x	2123.5	162.8
	-ctk tbqp3_0 -ctv tbq3_0	131072	~4.49x	1540	135.7
	-ctk tbqp3_0 -ctv tbq4_0	122624	~4.2x	1593.1	141.1
	-ctk tbqp3_0 -ctv q8_0	78080	~2.68x	1784	158.7
	-ctk tbq3_0 -ctv tbq3_0	131072	~4.49x	1305	142.8
	-ctk tbq3_0 -ctv tbq4_0	124672	~4.27x	1655.8	147.3
	-ctk tbq3_0 -ctv q8_0	78848	~2.7x	1807.5	160.8
Magistral-Small-2509-Heretic-v1.2.i1-Q4_K_M	-ctk f16 -ctv f16	4096	Baseline	1255.8	30.8
	-ctk q8_0 -ctv q8_0	5888	~1.44x	1357.1	39.8
	-ctk q4_0 -ctv q4_0	11008	~2.69x	1362.3	39.1
	-ctk tbqp4_0 -ctv tbq4_0	11776	~2.88x	665.5	33.4
	-ctk tbqp4_0 -ctv tbq3_0	13312	~3.25x	682.4	35.3
	-ctk tbqp4_0 -ctv q8_0	7680	~1.88x	808.1	37.2
	-ctk tbq4_0 -ctv tbq4_0	12032	~2.94x	845.7	37.1
	-ctk tbq4_0 -ctv tbq3_0	13568	~3.31x	766.4	36.8
	-ctk tbq4_0 -ctv q8_0	7936	~1.94x	1064.8	38.4
	-ctk tbqp3_0 -ctv tbq3_0	15360	~3.75x	744.3	31.7
	-ctk tbqp3_0 -ctv tbq4_0	13312	~3.25x	701.5	33.8
	-ctk tbqp3_0 -ctv q8_0	8448	~2.06x	562.2	36
	-ctk tbq3_0 -ctv tbq3_0	15616	~3.81x	288	36.1
	-ctk tbq3_0 -ctv tbq4_0	13568	~3.31x	741.1	35.2
	-ctk tbq3_0 -ctv q8_0	8448	~2.06x	895.5	38.1
Magistry-24B-v1.1.i1-Q4_K_M	-ctk f16 -ctv f16	4096	Baseline	1161.9	29.5
	-ctk q8_0 -ctv q8_0	5888	~1.44x	1293.3	39.9
	-ctk q4_0 -ctv q4_0	11008	~2.69x	1275.9	39.7
	-ctk tbqp4_0 -ctv tbq4_0	11776	~2.88x	589	28.1
	-ctk tbqp4_0 -ctv tbq3_0	13312	~3.25x	657.2	31.4
	-ctk tbqp4_0 -ctv q8_0	7680	~1.88x	663.8	35
	-ctk tbq4_0 -ctv tbq4_0	12032	~2.94x	693.8	36.7
	-ctk tbq4_0 -ctv tbq3_0	13568	~3.31x	747.8	37.1
	-ctk tbq4_0 -ctv q8_0	7936	~1.94x	1035.8	38
	-ctk tbqp3_0 -ctv tbq3_0	15360	~3.75x	758.5	35.6
	-ctk tbqp3_0 -ctv tbq4_0	13312	~3.25x	773.2	35.2
	-ctk tbqp3_0 -ctv q8_0	8448	~2.06x	568.6	37.5
	-ctk tbq3_0 -ctv tbq3_0	15616	~3.81x	285.5	36.1
	-ctk tbq3_0 -ctv tbq4_0	13568	~3.31x	748.2	36.8
	-ctk tbq3_0 -ctv q8_0	8448	~2.06x	884.7	38
Phi-4-reasoning-heretic.i1-Q6_K	-ctk f16 -ctv f16	14080	Baseline	2003.2	46.3
	-ctk q8_0 -ctv q8_0	26624	~1.89x	1949.2	45.9
	-ctk q4_0 -ctv q4_0	32768	~2.33x	1979.7	45.8
	-ctk tbqp4_0 -ctv tbq4_0	32768	~2.33x	713	39.7
	-ctk tbqp4_0 -ctv tbq3_0	32768	~2.33x	702.2	39
	-ctk tbqp4_0 -ctv q8_0	32768	~2.33x	853.4	41.3
	-ctk tbq4_0 -ctv tbq4_0	32768	~2.33x	919.1	41.4
	-ctk tbq4_0 -ctv tbq3_0	32768	~2.33x	813.1	40.9
	-ctk tbq4_0 -ctv q8_0	32768	~2.33x	1236.2	43.4
	-ctk tbqp3_0 -ctv tbq3_0	32768	~2.33x	802.8	39.6
	-ctk tbqp3_0 -ctv tbq4_0	32768	~2.33x	863.6	40.2
	-ctk tbqp3_0 -ctv q8_0	32768	~2.33x	1004.2	42.6
	-ctk tbq3_0 -ctv tbq3_0	32768	~2.33x	652.5	40.2
	-ctk tbq3_0 -ctv tbq4_0	32768	~2.33x	884.8	41.2
	-ctk tbq3_0 -ctv q8_0	32768	~2.33x	987	42.9
Qwen2.5-14B-Instruct-Heretic.i1-Q6_K	-ctk f16 -ctv f16	14592	Baseline	1845	43
	-ctk q8_0 -ctv q8_0	27904	~1.91x	1803	43.8
	-ctk q4_0 -ctv q4_0	32768	~2.25x	1855.9	44.6
	-ctk tbqp4_0 -ctv tbq4_0	32768	~2.25x	631.2	37.8
	-ctk tbqp4_0 -ctv tbq3_0	32768	~2.25x	617.6	37.5
	-ctk tbqp4_0 -ctv q8_0	32768	~2.25x	789.7	40.5
	-ctk tbq4_0 -ctv tbq4_0	32768	~2.25x	833.2	40.7
	-ctk tbq4_0 -ctv tbq3_0	32768	~2.25x	721.9	40.1
	-ctk tbq4_0 -ctv q8_0	32768	~2.25x	1179.8	42.7
	-ctk tbqp3_0 -ctv tbq3_0	32768	~2.25x	742	38.4
	-ctk tbqp3_0 -ctv tbq4_0	32768	~2.25x	782.9	39.4
	-ctk tbqp3_0 -ctv q8_0	32768	~2.25x	929.5	41.6
	-ctk tbq3_0 -ctv tbq3_0	32768	~2.25x	590.4	39.4
	-ctk tbq3_0 -ctv tbq4_0	32768	~2.25x	814.2	40.2
	-ctk tbq3_0 -ctv q8_0	32768	~2.25x	915.4	42
Qwen3.5-27B-heretic-v3.i1-Q3_K_M	-ctk f16 -ctv f16	17408	Baseline	1032.4	32.4
	-ctk q8_0 -ctv q8_0	32512	~1.87x	988.1	32.1
	-ctk q4_0 -ctv q4_0	59392	~3.41x	1014.3	31.6
(Outputs Only Question Marks)	-ctk tbqp4_0 -ctv tbq4_0	65024	~3.74x	641.5	28.7
(Outputs Only Question Marks)	-ctk tbqp4_0 -ctv tbq3_0	73216	~4.21x	441.4	25.3
	-ctk tbqp4_0 -ctv q8_0	43264	~2.49x	670.3	27.9
(Outputs Only Question Marks)	-ctk tbq4_0 -ctv tbq4_0	65280	~3.75x	846.3	29.9
(Outputs Only Question Marks)	-ctk tbq4_0 -ctv tbq3_0	73728	~4.24x	701.4	29.7
	-ctk tbq4_0 -ctv q8_0	43264	~2.49x	922.8	30.8
(Outputs Only Question Marks)	-ctk tbqp3_0 -ctv tbq3_0	83968	~4.82x	619.3	27.7
(Outputs Only Question Marks)	-ctk tbqp3_0 -ctv tbq4_0	73216	~4.21x	743.5	30
	-ctk tbqp3_0 -ctv q8_0	46592	~2.68x	653.9	28.3
(Outputs Only Question Marks)	-ctk tbq3_0 -ctv tbq3_0	84736	~4.87x	478.8	25.6
(Outputs Only Question Marks)	-ctk tbq3_0 -ctv tbq4_0	73728	~4.24x	725.7	28.2
	-ctk tbq3_0 -ctv q8_0	46848	~2.69x	766.2	29.7
Yi-1.5-9B-Chat-16K-abliterated.i1-Q6_K	-ctk f16 -ctv f16	16384	Baseline	2900.5	70.9
	-ctk q8_0 -ctv q8_0	16384	~1x	2863	70
	-ctk q4_0 -ctv q4_0	16384	~1x	2859	69.8
	-ctk tbqp4_0 -ctv tbq4_0	16384	~1x	817	57.1
	-ctk tbqp4_0 -ctv tbq3_0	16384	~1x	799.7	56.5
	-ctk tbqp4_0 -ctv q8_0	16384	~1x	1022.9	62
	-ctk tbq4_0 -ctv tbq4_0	16384	~1x	1097.7	62.2
	-ctk tbq4_0 -ctv tbq3_0	16384	~1x	952.8	61.2
	-ctk tbq4_0 -ctv q8_0	16384	~1x	1627.7	66.2
	-ctk tbqp3_0 -ctv tbq3_0	16384	~1x	966.9	57.4
	-ctk tbqp3_0 -ctv tbq4_0	16384	~1x	1016.2	59.8
	-ctk tbqp3_0 -ctv q8_0	16384	~1x	1242.7	64.2
	-ctk tbq3_0 -ctv tbq3_0	16384	~1x	745.2	59
	-ctk tbq3_0 -ctv tbq4_0	16384	~1x	1063.8	61.7
	-ctk tbq3_0 -ctv q8_0	16384	~1x	1227.9	64.6

3 replies

AmesianX Apr 2, 2026

@fritolays Thank you for this incredible 17-model sweep — 17 models across 11 architectures is exactly the kind of systematic validation we need. Gemma 7.94x, DeepSeek 5.09x on 16GB are great numbers, and confirming P1→P5 fixes Phi-4/DeepSeek detection is very valuable. Your tbq4_0/q8_0 best-balance recommendation is a great finding — will add to README.

Quick question: Did you build from source or use the release binaries? We discovered the release binaries had incorrect CUDA architectures (SM 52 default instead of SM 75-120). We've pulled those and are rebuilding now.

On the Qwen3.5 turbo V "???" issue:

We downloaded the exact same model (Qwen3.5-27B-heretic-v3.i1-Q3_K_M.gguf) and ran with the exact same options (--cache-type-k tbq3_0 --cache-type-v tbq3_0 -ngl 99 --flash-attn on). Output is completely normal on our local build (SM 121 native, DGX Spark GB10):

Hello! I'm doing well, thanks for asking. How about you?

No question marks. This strongly suggests the issue is build-related (SM 52 release binary running on SM 89 via JIT emulation), not a code bug.

On the 50–75% prompt processing drop: Also likely build-related. On our SM 121 native build, tbqp3/tbq3 prompt processing drops only ~15% vs f16 (94→80 t/s). SM 52 JIT emulation causes massive degradation on compute-heavy WHT kernels.

The corrected builds (SM 75/80/86/89/90/120 native) are compiling now — 6 architectures take significantly longer than the previous single-arch build. Will update when ready. Apologies for the build issue.

fritolays Apr 2, 2026

@AmesianX I used turboquant-bin-win-x64-cuda-cu12.8.zip and had supplied my own DLL's.

I've grabbed the new release turboquant-bin-win-x64-cuda-cu12.8.zip + cudart-turboquant-bin-win-cuda-cu12.8-x64.zip.
I manually ran the test and Qwen3.5-27B-heretic-v3.i1-Q3_K_M works fine with both newer and older releases of v1.3.0.
Seems to be an issue with using --single-turn --file "kv-bench_prompt.txt".
Loading a prompt file with this model specifically produces question marks.
--prompt "Hello" Works just fine, I'll adjust my testing script to account for this.

fritolays Apr 2, 2026

@AmesianX Actually after some more testing it seems like something else is going on...
I tried running the same test over and over, and sometimes it does work, maybe 1 in 30 tries, while using any TurboQuant K&V on Qwen3.5-27B-heretic-v3.i1-Q3_K_M or Qwen3.5-9B-heretic.Q8_0.
Works every time if prompt is short, say just Hello once.
But more than 250-256 tokens, it starts to fail.
The larger the prompt the more likely that it will fail.

Let me know if you need me to provide more details.

Xcc313r4n7 · 2026-04-02T04:21:53Z

Xcc313r4n7
Apr 2, 2026

@TheTom
Trying to test your build but I'm running tensor parallelism via thunderbolt 4 with RPC server. When I try to build for RPC server using your fork, I'm getting an error.

7 replies

Xcc313r4n7 Apr 2, 2026

I'm running across two machines, an rtx 5090 mobile and an rtx 4090 mobile via TB4 bridging

Xcc313r4n7 Apr 2, 2026

error: static assertion failed: GGML_OP_COUNT has changed - update RPC_PROTO_PATCH_VERSION
14 | static_assert(GGML_OP_COUNT == 96, "GGML_OP_COUNT has changed - update RPC_PROTO_PATCH_VERSION");

Xcc313r4n7 Apr 2, 2026

Very niche set up lol doing more than I should be allowed to with what I have.

TheTom Apr 2, 2026

That's a known issue with our fork. We added GGML_OP_TURBO_WHT as a new operation which bumped GGML_OP_COUNT past the 96 that the RPC server expects.

Quick fix in ggml/include/ggml-rpc.h line 14:

// Change from:
static_assert(GGML_OP_COUNT == 96, ...);
// To:
static_assert(GGML_OP_COUNT == 97, ...);

Also bump RPC_PROTO_PATCH_VERSION from 1 to 2 on line 11 of the same file.

Both your 5090 and 4090 machines need to be built from the same fork with this fix, otherwise the RPC protocol version will mismatch.

Xcc313r4n7 Apr 2, 2026

Thanks, builds completed, now to run some tests

Xcc313r4n7 · 2026-04-02T05:13:59Z

Xcc313r4n7
Apr 2, 2026

Damn, getting the bug I was getting the other day that was patched upstream in
#21030 :

[create_node] invalid tensor: null buffer (id=94197996336480)
[graph_compute] failed to create graph node 5 (id=94197996336480)

0 replies

plam40 · 2026-04-02T08:45:25Z

plam40
Apr 2, 2026

Hi, I have been running tests across different forks on RTX 4070ti with those parameters : Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf -ngl 99 -fa on --host 0.0.0.0 --port 8081 -np 1 -t 10 --top_p 0.95 --top_k 20, --min_p 0.0 --presence_penalty 0.0 --repeat_penalty 1.0 --temp .98 --jinja -ctk tbqp3 -ctv tbqp3 --ctx-size 130000 -fit off My testing has been real agentic workload with long running sessions, results so far : 1. WORKS FINE | spiritbuun/llama-cpp-turboquant (https://github.com/spiritbuun ) 2. WORKS FINE | Madrear/turbo3-cuda ( https://github.com/Madreag ) 3. https://github.com/AmesianX/TurboQuant/ - gets “/////////” after some 10k-20k tokens, tried with smaller context same thing. I did build for the specific CUDA platform ( RTX 4070ti - CMAKE_CUDA_ARCHITECTURES=89-real) BR From: AmesianX ***@***.*** Sent: Thursday, April 2, 2026 10:43 AM To: ggml-org/llama.cpp Cc: plam40; Manual Subject: Re: [ggml-org/llama.cpp] TurboQuant - Extreme KV Cache Quantization (Discussion #20969) @fritolays <https://github.com/fritolays> Follow-up: we tested tbq3_0/tbq3_0 on Qwen3.5-27B-UD-Q4_K_XL on our local build (SM 121 native, DGX Spark GB10) and cannot reproduce the "???" issue. Output is normal: Hello! I'm doing well, thanks for asking. How about you? This strongly suggests the issue is build-related (SM 52 vs native). Could you confirm whether you used the release binary or built from source? That will tell us if this is a CUDA arch problem or a real bug we need to fix. — Reply to this email directly, view it on GitHub <#20969?email_source=notifications&email_token=BZD7QBWBYBOMLN5H3R5XDH34TYKZNA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRUGE4TINRVUZZGKYLTN5XKM3LBNZ2WC3FFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-16419465> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/BZD7QBUWVSLB6HDVYNEN45L4TYKZNAVCNFSM6AAAAACW6IA4XKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNBRHE2DMNI> . You are receiving this because you are subscribed to this thread. <https://github.com/notifications/beacon/BZD7QBXEOYJ7OPRGYHE3DRD4TYKZNBFCNFSM6AAAAACW6IA4XKWGG33NNVSW45C7OR4XAZNRIRUXGY3VONZWS33OINXW23LFNZ2KUY3PNVWWK3TUL5UWJTQA7KFITJTSMVQXG33OUZWWC3TVMFWA.gif> Message ID: ***@***.***>

1 reply

slashedstar Apr 2, 2026

I'm on Win 11 RTX 4090 and similar here, after some fixes to make it work on Windows gemini got TheTom/turboquant_plus and Madreag/turbo3-cuda to work but on AmesianX/TurboQuant/releases/tag/v1.3.0 I got the repeating character problem on Qwen3.5-0.8B-IQ4_XS.gguf and OpenReasoning-Nemotron-32B-IQ4_XS (#20969)

Xcc313r4n7 · 2026-04-02T10:57:53Z

Xcc313r4n7
Apr 2, 2026

@TheTom
Something I've found very interesting that I thought I'd reach out to you about. This YouTuber took an existing fork of llama.cpp that works with one bit models, brought it up to date with the main fork to test the bonsai one bit model. What he's missing is the full turboquant stack. Might be interesting to see what sort of VRAM savings we might see with the two combined:

https://github.com/Mintplex-Labs/prism-ml-llama.cpp

0 replies

andrei-ace · 2026-04-02T11:25:26Z

andrei-ace
Apr 2, 2026

Related but different direction: I posted a show-and-tell here on an adaptive TurboQuant + RotateKV K-cache path for Qwen2.5:

#21297

Calibration-driven per-layer / per-channel K compression, with asymmetric K/V support.

A few numbers:

1.5B: 6.21 bpv K-only winner, 11.639 PPL vs 11.657 q8_0 and 11.640 f16
7B: 6.41 bpv K-only winner, 7.831 ± 0.300 vs 7.783 ± 0.298 q8_0 and 7.780 ± 0.297 f16
1.5B context sweep stays very close to q8_0/f16 out to 32k

Still a proof of concept and slower on decode, but maybe relevant to people following this thread.

0 replies

seanrasch · 2026-04-02T11:58:23Z

seanrasch
Apr 2, 2026

Ampere (RTX 3080 Ti) Benchmark Data — Mixed KV Parity, PR #36 Validation, PR #43 turbo4 Fix

@TheTom — you mentioned wanting CUDA mixed q8_0 × turbo parity data. Here's a full matrix from an RTX 3080 Ti (SM 86, Ampere, 12GB).

Model: Qwen3.5 9B Q4_K_M (bartowski)
Build: seanrasch/llama-cpp-turboquant master @ e99452f (upstream HEAD + V unpad gate fix)
Config: FA on, ngl=99, single GPU, 10s cooling between runs

Mixed q8_0 × turbo — Speed (t/s)

Cache (K / V)	pp512	pp2048	pp8192	pp16384	tg128
f16 / f16	4108	4198	4081	3888	106.2
q8_0 / q8_0	4063	4101	3984	3803	104.8
q8_0 / turbo3	3988	4027	3915	3746	103.6
q8_0 / turbo2	3988	4025	3915	3748	103.9
turbo3 / turbo3	3984	4020	3909	3739	102.7
turbo2 / turbo2	3978	4013	3904	3739	103.3

Mixed q8_0 × turbo — PPL (wikitext-2, 16K context)

Cache (K / V)	PPL
f16 / f16	6.5706
turbo3 / turbo3	6.6245
q8_0 / turbo3	6.6478

Takeaway: On Ampere, mixed q8_0-K + turbo3-V is within noise of symmetric turbo3/turbo3 for both speed and PPL. The K type doesn't meaningfully affect decode speed on this hardware — turbo K costs ~1% decode vs q8_0 K. Prefill is similarly flat across all turbo configs (~97% of f16). The asymmetric recommendation holds on SM 86.

PR #36 (signalnine FA optimizations) — Ampere A/B

Tested test/pr36-signalnine-fa-opt @ a75045b against master. Same model, same hardware.

Cache (K / V)	tg128 master	tg128 PR #36	Delta
f16 / f16	106.2	105.5	-0.7%
q8_0 / q8_0	104.8	104.6	-0.2%
turbo3 / turbo3	102.7	103.7	+1.0%
turbo2 / turbo2	103.3	104.0	+0.7%

PPL: symmetric turbo3 shifted from 6.6245 to 6.6478 — consistent with auto-asymmetric overriding to q8_0 K (matches the q8_0/turbo3 baseline exactly). Auto-asymmetric is working correctly.

Takeaway: PR #36 gives +35% decode on RTX 5090 but only +1% on Ampere. No regression on any config. The shmem LUT and occupancy tuning are Blackwell-optimized — Ampere's smaller shared memory doesn't benefit as much. Safe to merge from an Ampere perspective.

PR #43 (Dubascudes turbo4 WHT dequant fix) — Validated

We previously reported turbo4 K+V producing PPL 6952 on Llama arch (NeuralDaredevil 8B). Tested PR #43 @ 033742c on Qwen3.5 9B:

Cache (K / V)	PPL (2K)	Status
turbo4 / turbo4	7.3859	Fixed
turbo4 K / turbo3 V	7.4075	Sane
turbo3 K / turbo4 V	7.3648	Sane

turbo4/turbo4 speed: pp512=3970, pp16384=3707, tg128=101.8

The WHT/dense rotation mismatch in the C reference was indeed the root cause. turbo4 K+V is now functional on CUDA.

Hardware Note

RTX 3080 Ti (12GB, SM 86, Ampere, PCIe Gen 3 x16). This is representative of the 30-series/40-series consumer cards that most people are actually running. Happy to run additional configs if useful.

0 replies

primoco · 2026-04-02T13:26:36Z

primoco
Apr 2, 2026

Hi, sharing task-accuracy benchmarks as a complement to PPL — since as @TheTom noted, PPL and generation quality can diverge.

We ran a KV cache recall benchmark on Qwen3-14b (no-think mode) using the spiritbuun CUDA fork, testing whether quantization corrupts recall of information stored earlier in context. The method: place a math problem at position 0, fill with N tokens of unrelated text, ask the model to recall and solve.

52 tests: 2×2 matrix mult, 3×3 matrix mult, scalar arithmetic — at 0/200/500/1000 token filler distances.

Cache config	Accuracy	Tok/s	KV VRAM
F16 / F16	52/52 — 100%	73.1	100%
q8_0-K / tq4_0-V	52/52 — 100%	73.5	~62%
tq4_0-K / tq4_0-V	49/52 — 94.2%	56.0	~53%
tq3_0-K / tq3_0-V	47/52 — 90.4%	56.4	~44%

Key findings:

Asymmetric q8_0-K / tq4_0-V matches F16 exactly — same accuracy, same throughput, 38% less KV VRAM. This directly confirms @scos-lab's K/V magnitude disparity finding and @TheTom's "V is free" observation.
The throughput paradox is caused by quantized K. Symmetric TQ drops from 73 to 56 tok/s (-23%). Keeping K at q8_0 eliminates the FWHT overhead penalty entirely. This explains why some people see TQ slower than baseline without Flash Attention fully compensating.
Degradation only appears under context pressure with quantized K. Direct recall (no filler) is 100% across all configs. 3×3 matrix recall at 1000t filler: TQ3_0 drops to 1/3, q8k/tq4v stays 3/3.
Scalar arithmetic is completely immune at all distances and all quantization levels.

The benchmark script is open source: https://github.com/eullm/eullm/blob/main/bench/turboquant_math_accuracy.py

Tested on EULLM Engine (Ollama-compatible Rust inference server) wrapping the spiritbuun CUDA fork. ctx=16384, temperature=0, num_predict=2048.

0 replies

TurboQuant - Extreme KV Cache Quantization #20969

Uh oh!

Replies: 64 comments · 184 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC Mar 25, 2026 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Couldn't wait, so I spun something up; hopefully, it helps the final implementation. Feel free to cherry-pick :)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 64 comments 184 replies

CISC Mar 25, 2026
Collaborator