Finding: asymmetric K/V quantization (3b-K/2b-V) gives 6x better quality than symmetric — architecture dependent

Hey folks,

Sharing some KV cache compression results that might be relevant for llama.cpp's quantized KV cache work.

We've been developing NexusQuant, which uses E8 lattice vector quantization for KV cache compression. A few findings that might be useful:

**Asymmetric K/V matters a lot:** 3-bit keys + 2-bit values gives 6x better quality than symmetric 2-bit on Mistral-7B. This is consistent with what the TurboQuant+ folks found on llama.cpp. Keys are more sensitive because quantization noise propagates through softmax across all positions. Values are linearly combined, so their noise stays proportional.

**Architecture-dependent behavior:** Qwen2.5-7B completely breaks with any symmetric KV quantization (PPL goes astronomical). Protecting the first/last 2 layers at FP16 recovers it. Mistral and Phi-3 are fine without this. Worth checking if llama.cpp's KV quant has similar model-dependent behavior.

**K4V2 diminishing returns:** Going from 3-bit to 4-bit keys barely helps (+0.06pp). The softmax error floor is reached around 3-bit. This suggests the current Q4_0/Q5_0 KV quant in llama.cpp might be allocating more bits to keys than needed.

Our approach is Python-based (not useful for llama.cpp directly), but the findings about asymmetric bit allocation and boundary layer sensitivity might inform llama.cpp's KV quantization strategy.

Results + paper: https://github.com/jagmarques/nexusquant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding: asymmetric K/V quantization (3b-K/2b-V) gives 6x better quality than symmetric — architecture dependent #21591

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Finding: asymmetric K/V quantization (3b-K/2b-V) gives 6x better quality than symmetric — architecture dependent #21591

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions