Skip to content

ggml : add CPU TurboQuant KV cache types (TBQ3_0 / TBQ4_0)#21089

Closed
elusznik wants to merge 3 commits into
ggml-org:masterfrom
elusznik:turboquant-cpu-tbq-pr
Closed

ggml : add CPU TurboQuant KV cache types (TBQ3_0 / TBQ4_0)#21089
elusznik wants to merge 3 commits into
ggml-org:masterfrom
elusznik:turboquant-cpu-tbq-pr

Conversation

@elusznik

@elusznik elusznik commented Mar 27, 2026

Copy link
Copy Markdown

Summary

This PR adds CPU-only TurboQuant KV-cache support for two new cache types:

  • tbq3_0
  • tbq4_0

The scope is intentionally narrow for the first PR:

  • CPU-only
  • KV-cache types only
  • TBQ only (TBQP / Q-prod is left for follow-up work)

That keeps the initial landing aligned with the contributor guidance for new features and new ggml_type additions: start with CPU support first, keep the PR reviewable, and add backend support in follow-up PRs.

What changed

  • add GGML_TYPE_TBQ3_0 and GGML_TYPE_TBQ4_0
  • add block layouts and CPU quantize / dequantize support
  • add CPU vec_dot support so CPU flash attention can consume the new KV types
  • wire the new types into ggml type traits and quantization entry points
  • allow tbq3_0 / tbq4_0 in CLI KV-cache arguments
  • add llama-bench and quantize support for the new types
  • add CPU regression coverage in test-quantize-fns
  • add backend-op coverage for GET_ROWS, SET_ROWS, CPY, and FLASH_ATTN_EXT

Why this scope

I started from a broader TurboQuant implementation, but for the first upstream PR I cut the surface down to the part that is strongest on the current CPU-only evaluation:

  • tbq4_0 is the best-balanced TurboQuant point here
  • tbq3_0 is the memory-first option
  • the wider TBQP / split-outlier path is better handled as follow-up work after the plain TBQ CPU base lands

Block layout

  • tbq3_0: 98 bytes / 256 elements = 3.0625 bits / element
  • tbq4_0: 130 bytes / 256 elements = 4.0625 bits / element

CPU results

Model: Qwen3.5-4B-Q4_K_M.gguf

Settings:

  • CPU only
  • 4 threads
  • flash_attn=on
  • llama-bench with pp32/tg8
  • llama-perplexity on wikitext-2-raw/wiki.test.raw
  • ctx=256, chunks=4
Cache type Prompt t/s Gen t/s KV MiB Compression vs f16 PPL KLD RMS Δp Same top p
f16 50.67 15.72 64.00 1.00x 13.8387 0.00000 0.000% 100.000%
q8_0 50.63 15.67 34.00 1.88x 13.8348 0.00320 1.510% 97.835%
q4_0 50.46 15.64 18.00 3.56x 13.8400 0.00912 2.179% 93.898%
tbq3_0 46.19 8.29 12.25 5.22x 14.3198 0.02647 4.471% 91.732%
tbq4_0 45.84 8.31 16.25 3.94x 13.8323 0.00960 2.892% 94.094%

Key takeaways:

  • tbq4_0 is the best-balanced TurboQuant point in this CPU-only sweep.
  • tbq4_0 reduces KV cache below stock q4_0 while keeping similar KLD and slightly better perplexity in this run.
  • tbq3_0 pushes KV memory lower again, with the expected quality tradeoff.

Plots

KV cache memory usage

KV cache memory usage

Throughput

Throughput

Compression vs speed

Compression vs speed

Ablation: KV size vs KLD

Ablation: KV size vs KLD

Validation

Built locally:

  • cmake -S . -B build-cpu-pr -DCMAKE_BUILD_TYPE=Release
  • cmake --build build-cpu-pr --target test-quantize-fns test-backend-ops llama-bench llama-cli llama-perplexity -j4

Checks run:

  • ./build-cpu-pr/bin/test-quantize-fns
  • ./build-cpu-pr/bin/test-backend-ops test -b CPU -o GET_ROWS,SET_ROWS,CPY,FLASH_ATTN_EXT -p 'tbq'
  • llama-bench CPU comparison vs f16, q8_0, q4_0
  • llama-perplexity + KL divergence comparison vs f16

Follow-up work

Planned follow-ups after this CPU base:

  • TBQP / Q-prod variants
  • split outlier path
  • ROCm backend support
  • CUDA backend support

Acknowledgements

This work was informed by:

AI usage disclosure

AI tools were used in an assistive capacity for exploration, mechanical refactoring, test/benchmark scripting, and draft review text. The code and measurements in this PR were manually reviewed locally, the relevant checks were run manually, and I can explain the submitted changes and benchmark setup in detail.

@github-actions github-actions Bot added testing Everything test related examples server ggml changes relating to the ggml tensor library for machine learning labels Mar 27, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Mar 28, 2026

Copy link
Copy Markdown

Hi @elusznik, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@elusznik

Copy link
Copy Markdown
Author

Issue #20977

@elusznik elusznik marked this pull request as ready for review March 28, 2026 00:06
@elusznik elusznik requested review from a team, CISC, ggerganov and ngxson as code owners March 28, 2026 00:06
Copilot AI review requested due to automatic review settings March 28, 2026 00:06

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces two new CPU-only TurboQuant KV-cache ggml types (tbq3_0, tbq4_0) and wires them through ggml’s type system, CPU quantize/dequantize + vec_dot, llama KV/graph handling, tooling, and tests so they can be selected as KV cache formats and consumed by CPU flash-attention.

Changes:

  • Add GGML_TYPE_TBQ3_0 / GGML_TYPE_TBQ4_0 (plus ftype plumbing) with block layouts, quantize/dequantize, and CPU vec_dot support.
  • Update llama KV-cache views + attention graph to handle TBQ tensors (cast + reshape for attention).
  • Expose types in CLI/tools docs and add regression/tests coverage (test-quantize-fns, test-backend-ops).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tools/server/README.md Document tbq3_0/tbq4_0 as allowed KV cache types.
tools/quantize/quantize.cpp Add quantize tool options for TBQ ftypes.
tools/llama-bench/llama-bench.cpp Allow parsing tbq3_0/tbq4_0 type names.
tools/completion/README.md Document TBQ cache types for completion tool args.
tools/cli/README.md Document TBQ cache types for CLI args.
tests/test-quantize-fns.cpp Add TBQ dispatch + table/codebook checks and error thresholds.
tests/test-backend-ops.cpp Add backend-op coverage for TBQ in GET_ROWS/SET_ROWS/CPY/FLASH_ATTN_EXT.
src/llama-quant.cpp Add TBQ ftype/type mapping + fallback behavior.
src/llama-kv-cache.cpp Add TBQ-specific KV views (3D) for K/V retrieval.
src/llama-graph.cpp Cast+reshape TBQ KV tensors to feed flash/non-flash attention.
include/llama.h Add llama ftype enum entries for TBQ.
ggml/src/ggml.c Register TBQ type traits, ftype mapping, and quantize chunk dispatch.
ggml/src/ggml-turboq.h New TurboQuant helper API header.
ggml/src/ggml-turboq.c New TurboQuant helpers + TBQ3/TBQ4 quantize/dequantize implementations.
ggml/src/ggml-turboq-tables.h New TurboQuant Lloyd-Max codebooks/boundaries.
ggml/src/ggml-quants.h Declare TBQ quantize/dequantize entry points.
ggml/src/ggml-quants.c Add row-data validation for TBQ blocks.
ggml/src/ggml-cpu/quants.h Add CPU quantize + vec_dot declarations for TBQ.
ggml/src/ggml-cpu/quants.c Add CPU quantize wrappers and TBQ vec_dot (dequantize-then-dot) fallback.
ggml/src/ggml-cpu/ops.cpp Extend DUP handling for quantized->F16/BF16 and adjust quantized dup flow.
ggml/src/ggml-cpu/ggml-cpu.c Register TBQ CPU type traits (from_float, vec_dot, vec_dot_type).
ggml/src/ggml-cpu/arch-fallback.h Add tbq vec_dot fallback renames for some architectures.
ggml/src/ggml-common.h Define block_tbq3_0 / block_tbq4_0 layouts.
ggml/src/CMakeLists.txt Build and install TurboQuant sources/headers into ggml-base.
ggml/include/ggml.h Add new ggml type + ftype enum values.
common/arg.cpp Allow TBQ types in --cache-type-k/--cache-type-v parsing and help text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ggml/src/ggml-cpu/arch-fallback.h
Comment thread ggml/src/ggml-cpu/ops.cpp Outdated
Comment thread tests/test-backend-ops.cpp
Comment thread common/arg.cpp
@animehacker

animehacker commented Mar 28, 2026

Copy link
Copy Markdown

I've been working on extending unixsysdev's tq3_0 implementation with V cache support and flash attention. Repo here: https://github.com/animehacker/llama-turboquant

What this adds on top of unixsysdev's work:

  • Normalization fix (1/32 → 1/√32 for the asymmetric K-side WHT)
  • V cache compression (non-transposed storage + graph-side dequant to work around GGML's element-scatter path)
  • Flash attention with tq3_0 (dequant tq3_0 → F32 → F16 in the attention graph, then use existing FA kernel)
  • CPU backend F32 dequant path for pipeline parallelism

Tested on Llama-3.3-70B-Instruct-Q4_K_M, 2x RTX 3090:

  • 72K context with tq3_0 K+V (4.57x compression)
  • WikiText-2 PPL: 4.40 vs 4.09 baseline (+7.6%)

To be clear: this implements PolarQuant (Stage 1) only — WHT rotation + 3-bit Lloyd-Max. QJL residual correction is not included.

Paper with implementation details: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html

@animehacker

Copy link
Copy Markdown

I've been working on a TurboQuant implementation in llama.cpp's GGML framework (CUDA backend, tested on Llama-3.3-70B with 2x RTX 3090s). A few findings that might be useful for the vLLM implementation:

  1. The normalization factor for the WHT needs to be asymmetric: 1/√32 on the K-side during quantization, unnormalized on the Q-side. Using 1/32 (symmetric) produces plausible-looking but semantically broken output that's hard to catch without perplexity benchmarks.
  2. V cache compression is essentially free in terms of quality. In my WikiText-2 benchmarks, K-only compression cost +6.6% PPL while adding V compression on top only added another +1% for 4.57x total compression.
  3. For long context, dequanting to F16 and feeding into flash attention works well. Memory goes from O(n²) to O(n), which is what got us from 16K to 72K context.

Paper with implementation details: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html
Repo: https://github.com/animehacker/llama-turboquant

Happy to compare notes.

@elusznik

Copy link
Copy Markdown
Author

Addressed the actionable points from the Copilot review in 0aae7d78c:

  • fixed TBQ fallback symbol remaps for ARM/RISC-V in ggml-cpu/arch-fallback.h
  • fixed TBQ flash-attn AUTO validation to use merged GQA row widths in llama-context.cpp
  • removed the extra temp-buffer path for quantized -> F32 dup in ggml-cpu/ops.cpp
  • trimmed the redundant CPY test expansion while keeping the new quantized coverage in test-backend-ops.cpp

Re-ran:

  • cmake --build build-cpu-pr --target test-quantize-fns test-backend-ops llama-cli -j4
  • ./build-cpu-pr/bin/test-quantize-fns
  • ./build-cpu-pr/bin/test-backend-ops test -b CPU -o GET_ROWS,SET_ROWS,CPY,FLASH_ATTN_EXT -p 'tbq'
  • llama-bench smoke run with tbq4_0/tbq4_0

@CuriosityQuantified

Copy link
Copy Markdown

Hi @elusznik — great work on this PR. I've been running the turboquant-cpu-tbq-pr branch on an Apple Silicon M4 Mac mini (16GB unified memory) and implemented the NEON optimization listed in your follow-up roadmap. Sharing results and code here — I've also pushed a working branch at CuriosityQuantified/llama.cpp:neon-arm-optimization if you want to diff directly.


Build fix: arch-fallback.h missing ARM64 defines (blocks build on aarch64)

Before anything else — the branch does not build on ARM64 without this fix. The __aarch64__ section of ggml/src/ggml-cpu/arch-fallback.h is missing the two TBQ _generic symbol aliases, causing a linker failure:

// ggml/src/ggml-cpu/arch-fallback.h  (around line 81, inside #elif defined(__aarch64__))
#define ggml_vec_dot_tbq3_0_q8_K_generic ggml_vec_dot_tbq3_0_q8_K
#define ggml_vec_dot_tbq4_0_q8_K_generic ggml_vec_dot_tbq4_0_q8_K

NEON kernels

ggml/src/ggml-cpu/quants.cggml_vec_dot_tbq3_0_q8_K_generic and ggml_vec_dot_tbq4_0_q8_K_generic (identical pattern for both):

int j = 0;
#if defined(__ARM_NEON)
float32x4_t acc0 = vdupq_n_f32(0.0f);
float32x4_t acc1 = vdupq_n_f32(0.0f);
for (; j + 7 < QK_K; j += 8) {
    const float32x4_t tv0 = vld1q_f32(tmp + idx + j);
    const float32x4_t tv1 = vld1q_f32(tmp + idx + j + 4);
    const int8x8_t qi = vld1_s8(y[i].qs + j);
    const int16x8_t qi16 = vmovl_s8(qi);
    const float32x4_t qf0 = vcvtq_f32_s32(vmovl_s16(vget_low_s16(qi16)));
    const float32x4_t qf1 = vcvtq_f32_s32(vmovl_s16(vget_high_s16(qi16)));
    acc0 = vfmaq_f32(acc0, tv0, qf0);
    acc1 = vfmaq_f32(acc1, tv1, qf1);
}
sumf += d * vaddvq_f32(vaddq_f32(acc0, acc1));
#endif
// scalar tail
for (; j < QK_K; j++) {
    sumf += tmp[idx + j] * (d * y[i].qs[j]);
}
idx += QK_K;

ggml/src/ggml-turboq.cmatvec_row and matvec_t (add after existing AVX2 block):

#elif defined(__ARM_NEON)
float32x4_t acc0 = vdupq_n_f32(0.0f);
float32x4_t acc1 = vdupq_n_f32(0.0f);
for (; j + 7 < d; j += 8) {
    acc0 = vfmaq_f32(acc0, vld1q_f32(row + j),     vld1q_f32(x + j));
    acc1 = vfmaq_f32(acc1, vld1q_f32(row + j + 4), vld1q_f32(x + j + 4));
}
for (; j + 3 < d; j += 4) {
    acc0 = vfmaq_f32(acc0, vld1q_f32(row + j), vld1q_f32(x + j));
}
sum += vaddvq_f32(vaddq_f32(acc0, acc1));

Benchmark results — 8K context, Qwen3.5-4B-Q4_K_M, M4 Mac mini, 4 threads

Build: -DCMAKE_BUILD_TYPE=Release, Metal enabled, flash attention on, -nkvo 1 (KV on CPU, model weights on Metal GPU)

KV type pp t/s (generic C) pp t/s (NEON) delta compression
f16 312 306 1.0×
q4_0 307 291 3.6×
tbq4_0 258 276 +18 t/s (+7%) 3.9×
tbq3_0 253 274 +21 t/s (+8%) 5.2×

The gap to q4_0 narrows from ~50 t/s → ~16 t/s after NEON. The residual cost is the 128×128 Hadamard rotation matmuls (2 dense matmuls per 256-element block per TBQ block) — closing that fully would require a structured butterfly/WHT transform at the quantization algorithm level, not a kernel change.


Apple Silicon note

Running with -ngl 99 -ctk tbq4_0 (Metal KV offload) crashes — Metal backend does not support SET_ROWS for TBQ types. -nkvo 1 is the workaround (KV stays on CPU, model layers on Metal). Not a blocker for this PR scope, just worth noting for anyone testing on Apple Silicon.


Happy to submit a follow-up PR with these changes once this lands — or fold them in here if you prefer. Let me know what works best.

@CuriosityQuantified

@elusznik

Copy link
Copy Markdown
Author

Hello @CuriosityQuantified, thanks for your input. Unfortunately I do not have ARM64 experience so I couldn't really do it myself. When it comes to the PR, I think a separate request would be more in line with the contribution guidelines of this project

PedroRossi added a commit to PedroRossi/llama.cpp that referenced this pull request Mar 30, 2026
Based on PR ggml-org#21089 (CPU TurboQuant by elusznik), this adds CUDA kernel
support for the TBQ3_0 and TBQ4_0 KV cache quantization types.

New files:
- turboq.cu: GPU rotation matrix init, CUDA dequantize/quantize kernels
  - 128 threads/block, shared memory for codebook decode
  - O(d²) rotation matvec per block via global memory
- turboq.cuh: Kernel declarations

Modified files:
- set-rows.cu: Custom TBQ quantize dispatch
- convert.cu: TBQ→F32/F16 row dequantize
- cpy.cu: TBQ→F32/F16 copy (enables GPU-side ggml_cast in attention)
- ggml-cuda.cu: TBQ in SET_ROWS + CPY capability checks
- arch-fallback.h: ARM build fix (missing TBQ vec_dot macros)
- CMakeLists.txt: turboq.cu added to build

Key fix: Adding TBQ types to GGML_OP_CPY capability check enables the
existing ggml_cast() dequantize path in llama-graph.cpp to run on GPU,
improving generation from 2 → 9.5 tok/s (Llama 3B, GTX 1660 Super).

Benchmark (Llama 3.2 3B Q4_K_M, GTX 1660 Super 6GB):
- Prefill: 308 tok/s (4x baseline 75 tok/s)
- Generation: 9.5 tok/s (22% of baseline 42 tok/s)
- Max context: ~98K tokens (2x baseline ~49K)

The O(d²) rotation in dequantize remains the generation bottleneck.
Fused flash attention kernels would eliminate this overhead.
@CuriosityQuantified

Copy link
Copy Markdown

@elusznik — understood, separate PR it is. I've opened it against your fork: elusznik#1

It's scoped to just the two changes from my comment above — the arch-fallback.h ARM64 fix and the NEON kernels. No other modifications. Should be easy to review and fold in whenever this lands upstream.

@JohnAlcatraz

Copy link
Copy Markdown

The numbers for tbq3_0 look great, so I hope this will be merged.

@ekryski

ekryski commented Apr 5, 2026

Copy link
Copy Markdown

fwiw there are those of us (@TheTom, myself, and many others) that have been, and continue to, work on GPU optimized implementations. The CPU one is by far going to be the most underwhelming if you're comparing benchmarks.

Yes, now upstream does the rotation which is a lot of the work of TurboQuant (actually PolarQuant if you don't include QJL which kills speed but whatever). This was a good call out by @Mushoz! However, as @TheTom mentions above there are many other improvements that make getting this started worthwhile. Asymmetric KV compression is actually massive and works regardless of TurboQuant or not. In mine, Tom's and other people's extensive testing it is holding up very well - massive improvement in quantization, improvement in prefill and decode speed, with very little to even improved PPL and KLD. Truly a thing of beauty that @TheTom discovered there last week.

I very much respect @elusznik's intent to stick to guidelines and make this a minimal change in order to get the ball rolling. It's intimidating coming into such a popular repo with so much churn so respect and appreciation where it's due. With all due respect @ggerganov (and I have an immense amount), shitting on it and just calling it slop without any actual critique is just rude and makes you look like a jerk. I understand there's a lot of AI shit PRs around but this ain't one of them.

My two cents, is I think it is in everyone's best interest (the OS AI community) to help ensure that we don't have a ton of forks with disparate but material performance improvements littered about.

Hoping this lands (or something similar) and some of us can layer on discreet upstream PRs for some of the further decode, prefill and memory improvements we've identified.

@ZhaoanTan

Copy link
Copy Markdown

It looks good, waiting for code to be merged hungry.

@emircanerkul

Copy link
Copy Markdown

No news so far? I maybe expect too much but seeing only %10 improvement made me sad. Although it looks like cpu only. I want to use that for 6800xt gpu. Currently using gemma-4-26B-A4B-it-UD-IQ4_XS.gguf and just want to get best result from that. And indeed local LLMs getting better each day which is great news.

@elusznik

elusznik commented Apr 6, 2026

Copy link
Copy Markdown
Author

@ekryski thanks for the endorsement and the kind words, the slop comment made me feel kinda bad after putting in a couple of days' serious work into this

@Green-Sky

Copy link
Copy Markdown
Collaborator

Instead of looking at new quants, you can take a look at existing quants for kv cache here: #21551

@Green-Sky

Green-Sky commented Apr 7, 2026

Copy link
Copy Markdown
Collaborator

Asymmetric K/V is critical for some models
This is the biggest practical finding from my testing. Symmetric turbo (same type for K and V) is catastrophic on certain model families with Q4_K_M weights:

While looking at existing quants, I found an outlier kv quant pair that seems to perform better than quant neighbors. q3_K for K and q2_K for V.

w00jay added a commit to w00jay/llama.cpp-turoquant-research that referenced this pull request Apr 14, 2026
Research of TurboQuant paper, QJL reference code, and community implementations
reveals critical insights:
- QJL should be dropped entirely (MSE-only beats MSE+QJL in practice)
- Nobody uses TurboQuant for V (all use group quant or fp16)
- Without QJL, TBQ4 gets 16 centroids (matching q4_0 level count)
- PR ggml-org#21089 got PPL=9.046 (matching our 9.53) — gap vs q4_0 is expected
- The paper reports LongBench/NIAH, not perplexity

Also adds build time tracking log documenting CUDA template compilation
issues (ptxas uses 36GB+ RAM, 2+ hours for fattn.cu with VEC templates).
MrLordCat pushed a commit to MrLordCat/llama.cpp-with-GUI that referenced this pull request Apr 14, 2026
@toadlyBroodle

Copy link
Copy Markdown

The K3V2 / softmax-error-floor finding in #21591 (NexusQuant, by @jagmarques) quantifies the K-side bitcount question that's open here, and complements @TheTom's q8_0-K asymmetric data with a different recovery path on the architecture-sensitive cases.

Diminishing returns on K bitcount (NexusQuant, Mistral-7B):

Going from 3-bit to 4-bit keys: +0.06pp PPL. Going from 4 to 5: rounding noise. The reported mechanism: K feeds softmax, so K-quantization noise propagates across all positions, but the softmax error floor is hit at ~3 bits; further K precision buys nothing. V is linearly combined so V noise stays proportional, and lower V bitcount remains cheap.

Architecture-dependent layer-boundary breakage:

NexusQuant reports the Qwen2.5-7B catastrophic break with symmetric KV quant recovers by protecting the first and last 2 transformer layers at fp16, leaving the rest of K at low bitcount. This is layer-position protection, distinct from @TheTom's token-position boundary-V in layer-aware-v-compression.md. Mistral and Phi-3 don't need the protection per their data.

Two implications for the asymmetric K/V configs this PR is being asked to support:

  1. K8 is likely over-allocated. TheTom's q8_0-K / turbo3-V recovers Qwen2.5-7B from PPL 3,556 to 6.71, but per NexusQuant's K-bitcount curve the K side could drop to 3 bits at near-zero PPL cost and ~2.7x further K-side compression. Worth measuring asymmetric tbq3_0-K / tbq3_0-V (or q3_K / tbq3_0) against the K8 asymmetric baseline.

  2. Layer-boundary fp16 is a complementary lever. On architectures where bulk-layer K3 still breaks, protecting layers 0-1 and N-1, N at fp16 may recover quality without raising bulk K bitcount across the rest of the stack. Cheaper at long context than q8_0 across all layers.

Refs:

@jagmarques

jagmarques commented May 17, 2026

Copy link
Copy Markdown

On Qwen2.5-7B-Instruct, per-head fp16 masking of the lowest-2% KV-heads matches K3V2 boundary-protect on retrieval at 2.28 vs 3.93 K-bits/element. Not a quant type, so orthogonal to this PR, but a relevant knob for the K-bit/arch-sensitivity question.

@NovNovikov

Copy link
Copy Markdown

So because the CPU-only realisation brings small improvement - the CUDA realisation, which brings better improvement is blocked by design, because CUDA-realisation is only allowed after CPU one is passed?

@Green-Sky

Copy link
Copy Markdown
Collaborator

No. Its not a question of speed.

@pwilkin

pwilkin commented Jun 1, 2026

Copy link
Copy Markdown
Member

No, the reason TurboQuant quants are not being added is because:
-> nobody has yet proven that they're better than quants of similar bpw in terms of quality (KLD)
-> the main novelty with Hadamard rotations has already been merged into the core llama.cpp quants
-> there's otherwise no reason to add new quant types if there's no good justification for them

@TheTom

TheTom commented Jun 1, 2026

Copy link
Copy Markdown

No, the reason TurboQuant quants are not being added is because: -> nobody has yet proven that they're better than quants of similar bpw in terms of quality (KLD) -> the main novelty with Hadamard rotations has already been merged into the core llama.cpp quants -> there's otherwise no reason to add new quant types if there's no good justification for them

A couple of specifics on @pwilkin's three points, since most of the data is already in this PR and the surrounding thread.

On KLD at similar bpw: the table in this PR body answers it directly. On Qwen3.5-4B-Q4_K_M:

  • q4_0: PPL 13.8400, KLD 0.00912, 3.56x compression
  • tbq4_0: PPL 13.8323 (lower than q4_0), KLD 0.00960 (within noise), 3.94x compression

So tbq4_0 lands at slightly better PPL than q4_0 AND higher compression AND comparable KLD, in the PR's own CPU sweep. For an independent third-party reproduction at equivalent 4-bit V budget, @unamedkr / quantumaikr ported the algorithm to single-header C in https://github.com/quantumaikr/quant.cpp and measured Llama 3.2 3B turbo_kv_4b at PPL 14.28 (+5.3% vs FP32) vs llama.cpp q4_0 KV at PPL ~14.99 (+10.6%) on the same model. Same nominal 4-bit V cache budget, TurboQuant lower PPL.

Cross-model data from my own testing (50+ model/hardware combos, asymmetric K/V configs that survive aggressive weight quantization) is in my earlier comment and detailed at https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/asymmetric-kv-compression.md. KLD per-bit budget across MoE and dense variants: https://github.com/TheTom/turboquant_plus/blob/main/docs/kl-divergence-results.md.

On the rotation in upstream: PR #21038 implements rotation only. Quoting the PR body: "The implementation does not introduce new types and is compatible with all existing quantizations" and "I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this."

That's the design space TBQ in this PR and TurboQuant+ in mine occupy. Per Zandieh et al. (ICLR 2026, arxiv 2504.19874), TurboQuant pairs WHT rotation with Lloyd-Max-optimal codebook quantization tuned for the post-rotation distribution. The codebook is not in upstream, and that codebook is what enables sub-4-bpw operation at the quality envelope in the PR table. Rotation alone wrapping q4_0 keeps q4_0's compression and quality profile, which is a different design point.

On intent: this is opt-in via the existing --cache-type-k/v knob, alongside q4_0, q8_0, etc. The target users are hardware-constrained deployments where the bpw reduction unlocks a model size or context length that wouldn't otherwise fit (consumer GPUs at long context, larger MoE on laptop-class VRAM, mobile/edge inference). Default behavior unchanged for anyone who doesn't opt in.

@rujialiu

rujialiu commented Jun 2, 2026

Copy link
Copy Markdown

On intent: this is opt-in via the existing --cache-type-k/v knob, alongside q4_0, q8_0, etc. The target users are hardware-constrained deployments where the bpw reduction unlocks a model size or context length that wouldn't otherwise fit (consumer GPUs at long context, larger MoE on laptop-class VRAM, mobile/edge inference). Default behavior unchanged for anyone who doesn't opt in.

Yes, I definitely need more such knobs for exactly the reason mentioned above (unlock more model/context combination)

@Green-Sky

Copy link
Copy Markdown
Collaborator

Someone did a little study here https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

There is a lot of detail in there.

Tests on Qwen 3.6 27B show why TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM.

@alfredomariamilano

Copy link
Copy Markdown

Someone did a little study here https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

There is a lot of detail in there.

Tests on Qwen 3.6 27B show why TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM.

I read that and tried q5_0 and q4_0 symmetric: token generation collapsed to 18-20 tok/s respectively (I made sure they both fit comfortably in VRAM), while turbo4 symmetric (using buun-llama.cpp) gets me 36 tok/s. With that speed difference, I can more than make up for perplexity by running a query again.

@JohannesGaessler

Copy link
Copy Markdown
Contributor

As pointed out by other maintainers and also myself: so far no one has presented evidence that would justify the opportunity cost of adding TurboQuant support. Anyways, closing this particular PR since it conflicts with procedure.

@aminya

aminya commented Jun 2, 2026

Copy link
Copy Markdown

As pointed out by other maintainers and also myself: so far no one has presented evidence that would justify the opportunity cost of adding TurboQuant support. Anyways, closing this particular PR since it conflicts with procedure.

On consumer GPU with Tom's fork, me and thousands of othersa are able to run Qwen models locally with full context without a loss of accuracy. I could run LLMF on 10 year old GTX 1650 with 4 GB VRAM at 50 t/s at full context, something unimaginable. If you need actual numbers, it's not hard to run an experiment, but I would be happy to provide benchmarks if it helps.

@ggml-org ggml-org locked as resolved and limited conversation to collaborators Jun 3, 2026
@pwilkin

pwilkin commented Jun 3, 2026

Copy link
Copy Markdown
Member

If someone has a feature showing demonstrable gains with some quants compared to equivalent quantization types already in the main repo, then please open a new PR with those changes and with full benchmarks as indicated in the contributing guidelines. Sorry, but we cannot keep up with all the discussions that are based on vague unsubstantiated "feels" and claims without evidence.

For reference: this is a model analysis of the type I expect when adding a new quantization type.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

examples ggml changes relating to the ggml tensor library for machine learning server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.