Skip to content

Finding: asymmetric K/V quantization (3b-K/2b-V) gives 6x better quality than symmetric — architecture dependent #21591

@jagmarques

Description

@jagmarques

Hey folks,

Sharing some KV cache compression results that might be relevant for llama.cpp's quantized KV cache work.

We've been developing NexusQuant, which uses E8 lattice vector quantization for KV cache compression. A few findings that might be useful:

Asymmetric K/V matters a lot: 3-bit keys + 2-bit values gives 6x better quality than symmetric 2-bit on Mistral-7B. This is consistent with what the TurboQuant+ folks found on llama.cpp. Keys are more sensitive because quantization noise propagates through softmax across all positions. Values are linearly combined, so their noise stays proportional.

Architecture-dependent behavior: Qwen2.5-7B completely breaks with any symmetric KV quantization (PPL goes astronomical). Protecting the first/last 2 layers at FP16 recovers it. Mistral and Phi-3 are fine without this. Worth checking if llama.cpp's KV quant has similar model-dependent behavior.

K4V2 diminishing returns: Going from 3-bit to 4-bit keys barely helps (+0.06pp). The softmax error floor is reached around 3-bit. This suggests the current Q4_0/Q5_0 KV quant in llama.cpp might be allocating more bits to keys than needed.

Our approach is Python-based (not useful for llama.cpp directly), but the findings about asymmetric bit allocation and boundary layer sensitivity might inform llama.cpp's KV quantization strategy.

Results + paper: https://github.com/jagmarques/nexusquant

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions