Per-layer, per-channel adaptive K-cache quantization for Qwen2.5 with asymmetric K/V quantization (TurboQuant + RotateKV) #21297

andrei-ace · 2026-04-02T11:22:17Z

andrei-ace
Apr 2, 2026

Just to avoid mixing topics: the other branch/results being discussed are for the more standard TurboQuant direction. This branch is a different experiment: it combines ideas from TurboQuant and RotateKV, and explores an adaptive per-layer / per-group K-cache path instead of a single fixed low-bit recipe.

It also works with asymmetric K/V cache types. In practice I focused mainly on K, because that is where I found the most exploitable structure: strong per-layer differences, consistent outlier concentration after permutation, and a clear benefit from choosing different quantization layouts for different layers/groups. For the main quality measurements below I keep V in f16 on purpose, to isolate the effect of compressing K. I also did generation checks with asymmetric K/V setups, and those worked too.

The main question I wanted to answer first was not “is this faster than q8_0?”, but:

can compressed K work on Qwen2.5 small models at all, if we calibrate and choose the quantization layout per model / per layer?

So far, the answer looks like yes.

Branch/fork here:
https://github.com/andrei-ace/llama.cpp/tree/feature/turboquant-rotatekv

Results doc here:
https://github.com/andrei-ace/llama.cpp/blob/feature/turboquant-rotatekv/docs/turboquant-flex-results.md

I also built a calibration + experimentation tool (llama-tq-calibrate) together with the CPU / Metal K-cache types needed to try different T-bit combinations, split vs non-split layouts, thresholds, and QJL choices on real models.

Compared with q8_0, the point of this approach is not that q8_0 is bad. q8_0 is actually a very strong baseline: simple, robust, and close to f16. The point is that q8_0 treats all layers/channels in a much more uniform way, while this branch tries to exploit the fact that Qwen2.5 K-cache structure is not uniform:

some layers benefit from a split layout
some layers do better with a uniform non-split layout
some groups want higher bits, others tolerate much less
some cases benefit from QJL, others do not enough to justify it

So this is basically a calibration-driven attempt to beat the “one cache quant fits all layers” approach.

What the winning configs actually look like

Qwen2.5-1.5B winner (6.21 bpv for K, V=f16)

4 layers with strong outlier structure (>70% of K variance concentrated in 32 of 128 channels):

Split 9/4 + QJL on both groups
permute channels so the 32 outliers come first
FWHT-32 on the outlier group, 3×FWHT-32 on the regular channels
9-bit MSE (512 centroids) on outliers
4-bit MSE (16 centroids) on regulars
1-bit QJL sign correction on both groups
effective cost: 6.75 bpv

24 layers with more uniform variance (<70% concentration):

Non-split 6-bit
no permutation, no split
FWHT-128 on the full 128-dim head
6-bit MSE (64 centroids) on all channels equally
no QJL
effective cost: 6.125 bpv

This mix gives the final 6.21 bpv K-only winner.

At ctx=512 / 20 chunks:

f16 KV: 11.640 PPL
q8_0 KV: 11.657 PPL (+0.1%)
TQ best: 11.639 PPL (-0.01%) at 2.6x K compression vs f16
TQ+QJL: 11.638 PPL (-0.02%)

So for this model, the adaptive K path is not just “usable”, it is essentially on par with q8_0 / f16 in the reported run while using much less space for K.

Calibration also seems robust across source quantizations:

perms calibrated on fp16 / q8_0 / q4_k_m all gave near-identical runtime results
outlier overlap vs fp16 was 96.2% for q8_0 and 96.1% for q4_k_m

Qwen2.5-7B winner (6.41 bpv for K, V=f16)

7 layers with strong outliers (>70% concentration):

Split 10/5 + QJL on hi only
same split idea as above
10-bit MSE (1024 centroids) on outliers
5-bit MSE (32 centroids) on regulars
QJL only on the outlier group
effective cost: 6.88 bpv

21 layers with more uniform variance:

Non-split 5-bit + QJL
FWHT-128 on the full head
5-bit MSE (32 centroids) on all 128 channels
plus QJL correction
effective cost: 6.25 bpv

This mix gives the final 6.41 bpv K-only winner.

At 20 chunks:

f16 KV: 7.780 ± 0.297
q8_0 KV: 7.783 ± 0.298
TQ winner: 7.831 ± 0.300 (+0.7%)
TQ QJL all: 7.826 ± 0.300 (+0.6%)

So here the adaptive K path is still very close to q8_0, but not as tightly optimized yet as 1.5B.

All reported 7B results fall within each other’s 95% confidence intervals.

In a multilingual sanity check, the Korean transliteration / Pauli-style test matched the f16 output exactly on the tested 6-name prompt.

Why the configs differ per model

The 7B has more extreme early layers (0–3 are all >70%) and needs higher precision on the outlier group to preserve multilingual quality.

The 1.5B has fewer extreme layers, and at short context it does not seem to need QJL on the non-split layers.

So even within the same model family, the good config is clearly not “one fixed recipe.”

QJL observations

At short context (<8k), QJL makes almost no measurable difference — the 1-bit sign correction is mostly within PPL noise.

At 16k+ context, QJL starts showing a small but consistent benefit as inner-product bias accumulates across many attention positions.

For example, at 32k context on the 1.5B:

TQ without QJL: 9.038 PPL
TQ with QJL: 9.032 PPL
q8_0: 9.046 PPL

So the effect is tiny, but consistently in favor of QJL.

For the 7B winner, I include QJL everywhere as a long-context safety margin:

hi-only on split layers
full QJL on non-split layers

This costs about 0.2–0.4 extra bpv, but seems like a reasonable tradeoff for long-context stability.

There was one important exception: on the 7B, 9-bit hi + QJL corrupted multilingual output (mixed Korean/Latin characters). This looks like a precision issue, not a code bug: at 9-bit MSE the QJL perturbation is large enough to shift rare-token attention, while at 10-bit the residual is smaller and QJL works cleanly. That is why the final 7B config uses 10-bit hi specifically.

Context length sweep

For the 1.5B winner, the quality stays close to both f16 and q8_0 across a wide context range:

Context	f16 KV (16.0 bpv)	q8_0 KV (8.5 bpv)	TQ best (6.21 bpv)	TQ+QJL (7.18 bpv)
512	12.413	12.420	12.419	12.419
1024	9.163	9.183	9.190	9.187
2048	8.904	8.911	8.916	8.915
4096	8.926	8.939	8.940	8.936
8192	7.971	7.977	7.978	7.976
16384	7.130	7.129	7.140	7.132
32768	9.108 ±0.060	9.136	9.121	9.118

A nice detail here is that at 32k, both adaptive K configs still beat q8_0, and QJL gives a small additional gain.

Search/calibration observations

A few things I found interesting during calibration/search:

for 1.5B, not all layers benefit from a forced 32/96 split
many “moderate” layers do better with uniform FWHT-128 + 6-bit
the search flow was basically:
- find good hi bits
- then lo bits
- then decide which layers should actually be split
- then tune the threshold between split and non-split

There is also an actual workflow now, not just a one-off experiment:

# Step 1: calibrate
llama-tq-calibrate -m model.gguf -f ptb.txt -o perms.bin \
    --flex-extreme 1:9:4:1:1 \
    --flex-high 1:9:4:1:1 \
    --flex-moderate 0:6:0:0:0 \
    --flex-threshold-high 70

# Step 2: run
llama-completion -m model.gguf -p "prompt" \
    -ctk tqk --tq-perms perms.bin -ngl 99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-layer, per-channel adaptive K-cache quantization for Qwen2.5 with asymmetric K/V quantization (TurboQuant + RotateKV) #21297

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Per-layer, per-channel adaptive K-cache quantization for Qwen2.5 with asymmetric K/V quantization (TurboQuant + RotateKV) #21297

Uh oh!

andrei-ace Apr 2, 2026

What the winning configs actually look like

Qwen2.5-1.5B winner (6.21 bpv for K, V=f16)

Qwen2.5-7B winner (6.41 bpv for K, V=f16)

Why the configs differ per model

QJL observations

Context length sweep

Search/calibration observations

Replies: 0 comments

andrei-ace
Apr 2, 2026