Finding
Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512 / CPU build:
| cache-type-k/v |
PPL |
vs f16 baseline |
f16 |
19.08 |
baseline |
q8_0 |
19.08 |
lossless |
tbq3_0 |
1252.30 |
65x worse |
tbq4_0 |
1393.00 |
73x worse |
TBQ KV-cache produces near-random output. The encoding scheme doesn't preserve enough information to function as a KV cache.
Likely root cause
TBQ was designed as a weight quantization (rotated-domain 3-bit codebook calibrated for static weight distributions). Repurposing it as a KV cache encoding hits a fundamentally different statistical regime — K/V tensors during inference don't share the distributional properties TBQ's codebook was tuned for. The codebook can't faithfully encode the values being stored.
The earlier-reported TBQ KV-cache CPU SEGFAULT (ggml-org#125) is sporadic and allocation-pattern-dependent. gdb backtrace points at llama_vocab::find_bpe_rank during common_tokenize — i.e. TBQ KV-cache allocation heap-stomps adjacent vocab data. ASan-built run completes (red-zoned allocator dodges the stomp) but the quality regression remains. The segfault is real but secondary to the quality issue.
Cluster audit (snoop-kube 2026-06-04)
Zero tbq* KV-cache deployments across titan / centurion / lithium / embedding / llama router ConfigMaps. Every host uses q8_0 or q4_0. titan's gemma-31b preset comment explicitly justifies q8_0: "acceptable, KL 0.108, ~3.5x better than the 26B-A4B MoE". There is no production demand for TBQ KV-cache.
DFlash deliberately also uses q8_0
scripts/bench-dflash.sh (PR #65) uses --cache-type-k q8_0 --cache-type-v q8_0 for the DFlash speculative-decoding harness. No DFlash path requires TBQ KV-cache.
Recommendation
Three options for Markus's decision:
- Mark experimental in CLI help (cheap, reversible) — add a
* marker + footnote to the --cache-type-k/v allowed values list, referencing this issue. Code stays. Default behavior unchanged.
- Remove from CLI flag table (mid-cost) — code stays but the option isn't surfaced. Anyone setting it explicitly via API gets the broken behavior. Less invasive than full removal.
- Full rip (high-cost, irreversible per PR but the rewritten ht history makes this clean) — remove TBQ KV-cache code paths entirely. Frees the cache-type dispatch from dead branches.
Recommendation: Option 1, since it's the cheapest reversible step and Markus may have a roadmap intent I'm not aware of. PR coming separately for option 1; happy to do 2 or 3 instead if preferred.
Related
Finding
Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512 / CPU build:
f16q8_0tbq3_0tbq4_0TBQ KV-cache produces near-random output. The encoding scheme doesn't preserve enough information to function as a KV cache.
Likely root cause
TBQ was designed as a weight quantization (rotated-domain 3-bit codebook calibrated for static weight distributions). Repurposing it as a KV cache encoding hits a fundamentally different statistical regime — K/V tensors during inference don't share the distributional properties TBQ's codebook was tuned for. The codebook can't faithfully encode the values being stored.
Subordinate finding (ggml-org#125)
The earlier-reported TBQ KV-cache CPU SEGFAULT (ggml-org#125) is sporadic and allocation-pattern-dependent. gdb backtrace points at
llama_vocab::find_bpe_rankduringcommon_tokenize— i.e. TBQ KV-cache allocation heap-stomps adjacent vocab data. ASan-built run completes (red-zoned allocator dodges the stomp) but the quality regression remains. The segfault is real but secondary to the quality issue.Cluster audit (snoop-kube 2026-06-04)
Zero
tbq*KV-cache deployments across titan / centurion / lithium / embedding / llama router ConfigMaps. Every host usesq8_0orq4_0. titan's gemma-31b preset comment explicitly justifies q8_0: "acceptable, KL 0.108, ~3.5x better than the 26B-A4B MoE". There is no production demand for TBQ KV-cache.DFlash deliberately also uses q8_0
scripts/bench-dflash.sh(PR #65) uses--cache-type-k q8_0 --cache-type-v q8_0for the DFlash speculative-decoding harness. No DFlash path requires TBQ KV-cache.Recommendation
Three options for Markus's decision:
*marker + footnote to the--cache-type-k/vallowed values list, referencing this issue. Code stays. Default behavior unchanged.Recommendation: Option 1, since it's the cheapest reversible step and Markus may have a roadmap intent I'm not aware of. PR coming separately for option 1; happy to do 2 or 3 instead if preferred.
Related
project_tbq_kv_cache_status.md]