Per-layer, per-channel adaptive K-cache quantization for Qwen2.5 with asymmetric K/V quantization (TurboQuant + RotateKV) #21297
andrei-ace
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Just to avoid mixing topics: the other branch/results being discussed are for the more standard TurboQuant direction. This branch is a different experiment: it combines ideas from TurboQuant and RotateKV, and explores an adaptive per-layer / per-group K-cache path instead of a single fixed low-bit recipe.
It also works with asymmetric K/V cache types. In practice I focused mainly on K, because that is where I found the most exploitable structure: strong per-layer differences, consistent outlier concentration after permutation, and a clear benefit from choosing different quantization layouts for different layers/groups. For the main quality measurements below I keep V in f16 on purpose, to isolate the effect of compressing K. I also did generation checks with asymmetric K/V setups, and those worked too.
The main question I wanted to answer first was not “is this faster than
q8_0?”, but:So far, the answer looks like yes.
Branch/fork here:
https://github.com/andrei-ace/llama.cpp/tree/feature/turboquant-rotatekv
Results doc here:
https://github.com/andrei-ace/llama.cpp/blob/feature/turboquant-rotatekv/docs/turboquant-flex-results.md
I also built a calibration + experimentation tool (
llama-tq-calibrate) together with the CPU / Metal K-cache types needed to try different T-bit combinations, split vs non-split layouts, thresholds, and QJL choices on real models.Compared with
q8_0, the point of this approach is not thatq8_0is bad.q8_0is actually a very strong baseline: simple, robust, and close to f16. The point is thatq8_0treats all layers/channels in a much more uniform way, while this branch tries to exploit the fact that Qwen2.5 K-cache structure is not uniform:So this is basically a calibration-driven attempt to beat the “one cache quant fits all layers” approach.
What the winning configs actually look like
Qwen2.5-1.5B winner (6.21 bpv for K, V=f16)
4 layers with strong outlier structure (>70% of K variance concentrated in 32 of 128 channels):
24 layers with more uniform variance (<70% concentration):
This mix gives the final 6.21 bpv K-only winner.
At ctx=512 / 20 chunks:
So for this model, the adaptive K path is not just “usable”, it is essentially on par with
q8_0/ f16 in the reported run while using much less space for K.Calibration also seems robust across source quantizations:
Qwen2.5-7B winner (6.41 bpv for K, V=f16)
7 layers with strong outliers (>70% concentration):
21 layers with more uniform variance:
This mix gives the final 6.41 bpv K-only winner.
At 20 chunks:
So here the adaptive K path is still very close to
q8_0, but not as tightly optimized yet as 1.5B.All reported 7B results fall within each other’s 95% confidence intervals.
In a multilingual sanity check, the Korean transliteration / Pauli-style test matched the f16 output exactly on the tested 6-name prompt.
Why the configs differ per model
The 7B has more extreme early layers (0–3 are all >70%) and needs higher precision on the outlier group to preserve multilingual quality.
The 1.5B has fewer extreme layers, and at short context it does not seem to need QJL on the non-split layers.
So even within the same model family, the good config is clearly not “one fixed recipe.”
QJL observations
At short context (<8k), QJL makes almost no measurable difference — the 1-bit sign correction is mostly within PPL noise.
At 16k+ context, QJL starts showing a small but consistent benefit as inner-product bias accumulates across many attention positions.
For example, at 32k context on the 1.5B:
So the effect is tiny, but consistently in favor of QJL.
For the 7B winner, I include QJL everywhere as a long-context safety margin:
This costs about 0.2–0.4 extra bpv, but seems like a reasonable tradeoff for long-context stability.
There was one important exception: on the 7B, 9-bit hi + QJL corrupted multilingual output (mixed Korean/Latin characters). This looks like a precision issue, not a code bug: at 9-bit MSE the QJL perturbation is large enough to shift rare-token attention, while at 10-bit the residual is smaller and QJL works cleanly. That is why the final 7B config uses 10-bit hi specifically.
Context length sweep
For the 1.5B winner, the quality stays close to both f16 and q8_0 across a wide context range:
A nice detail here is that at 32k, both adaptive K configs still beat
q8_0, and QJL gives a small additional gain.Search/calibration observations
A few things I found interesting during calibration/search:
There is also an actual workflow now, not just a one-off experiment:
Beta Was this translation helpful? Give feedback.
All reactions