Skip to content

TBQ KV-cache (tbq3_0 / tbq4_0): 65-73x perplexity regression — recommend mark experimental #70

@marksverdhei

Description

@marksverdhei

Finding

Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512 / CPU build:

cache-type-k/v PPL vs f16 baseline
f16 19.08 baseline
q8_0 19.08 lossless
tbq3_0 1252.30 65x worse
tbq4_0 1393.00 73x worse

TBQ KV-cache produces near-random output. The encoding scheme doesn't preserve enough information to function as a KV cache.

Likely root cause

TBQ was designed as a weight quantization (rotated-domain 3-bit codebook calibrated for static weight distributions). Repurposing it as a KV cache encoding hits a fundamentally different statistical regime — K/V tensors during inference don't share the distributional properties TBQ's codebook was tuned for. The codebook can't faithfully encode the values being stored.

Subordinate finding (ggml-org#125)

The earlier-reported TBQ KV-cache CPU SEGFAULT (ggml-org#125) is sporadic and allocation-pattern-dependent. gdb backtrace points at llama_vocab::find_bpe_rank during common_tokenize — i.e. TBQ KV-cache allocation heap-stomps adjacent vocab data. ASan-built run completes (red-zoned allocator dodges the stomp) but the quality regression remains. The segfault is real but secondary to the quality issue.

Cluster audit (snoop-kube 2026-06-04)

Zero tbq* KV-cache deployments across titan / centurion / lithium / embedding / llama router ConfigMaps. Every host uses q8_0 or q4_0. titan's gemma-31b preset comment explicitly justifies q8_0: "acceptable, KL 0.108, ~3.5x better than the 26B-A4B MoE". There is no production demand for TBQ KV-cache.

DFlash deliberately also uses q8_0

scripts/bench-dflash.sh (PR #65) uses --cache-type-k q8_0 --cache-type-v q8_0 for the DFlash speculative-decoding harness. No DFlash path requires TBQ KV-cache.

Recommendation

Three options for Markus's decision:

  1. Mark experimental in CLI help (cheap, reversible) — add a * marker + footnote to the --cache-type-k/v allowed values list, referencing this issue. Code stays. Default behavior unchanged.
  2. Remove from CLI flag table (mid-cost) — code stays but the option isn't surfaced. Anyone setting it explicitly via API gets the broken behavior. Less invasive than full removal.
  3. Full rip (high-cost, irreversible per PR but the rewritten ht history makes this clean) — remove TBQ KV-cache code paths entirely. Frees the cache-type dispatch from dead branches.

Recommendation: Option 1, since it's the cheapest reversible step and Markus may have a roadmap intent I'm not aware of. PR coming separately for option 1; happy to do 2 or 3 instead if preferred.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions