TBQ KV-cache (tbq3_0 / tbq4_0): 65-73x perplexity regression — recommend mark experimental

## Finding

Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512 / CPU build:

| cache-type-k/v | PPL    | vs f16 baseline |
|----------------|--------|-----------------|
| `f16`          | 19.08  | baseline        |
| `q8_0`         | 19.08  | lossless        |
| `tbq3_0`       | 1252.30 | **65x worse**  |
| `tbq4_0`       | 1393.00 | **73x worse**  |

TBQ KV-cache produces near-random output. The encoding scheme doesn't preserve enough information to function as a KV cache.

## Likely root cause

TBQ was designed as a *weight* quantization (rotated-domain 3-bit codebook calibrated for static weight distributions). Repurposing it as a *KV cache* encoding hits a fundamentally different statistical regime — K/V tensors during inference don't share the distributional properties TBQ's codebook was tuned for. The codebook can't faithfully encode the values being stored.

## Subordinate finding (#125)

The earlier-reported TBQ KV-cache CPU SEGFAULT (#125) is sporadic and allocation-pattern-dependent. gdb backtrace points at `llama_vocab::find_bpe_rank` during `common_tokenize` — i.e. TBQ KV-cache allocation heap-stomps adjacent vocab data. ASan-built run completes (red-zoned allocator dodges the stomp) but the quality regression remains. The segfault is real but secondary to the quality issue.

## Cluster audit (snoop-kube 2026-06-04)

Zero `tbq*` KV-cache deployments across titan / centurion / lithium / embedding / llama router ConfigMaps. Every host uses `q8_0` or `q4_0`. titan's gemma-31b preset comment explicitly justifies q8_0: \"acceptable, KL 0.108, ~3.5x better than the 26B-A4B MoE\". There is no production demand for TBQ KV-cache.

## DFlash deliberately also uses q8_0

`scripts/bench-dflash.sh` (PR #65) uses `--cache-type-k q8_0 --cache-type-v q8_0` for the DFlash speculative-decoding harness. No DFlash path requires TBQ KV-cache.

## Recommendation

Three options for Markus's decision:

1. **Mark experimental in CLI help** (cheap, reversible) — add a `*` marker + footnote to the `--cache-type-k/v` allowed values list, referencing this issue. Code stays. Default behavior unchanged.
2. **Remove from CLI flag table** (mid-cost) — code stays but the option isn't surfaced. Anyone setting it explicitly via API gets the broken behavior. Less invasive than full removal.
3. **Full rip** (high-cost, irreversible per PR but the rewritten ht history makes this clean) — remove TBQ KV-cache code paths entirely. Frees the cache-type dispatch from dead branches.

Recommendation: **Option 1**, since it's the cheapest reversible step and Markus may have a roadmap intent I'm not aware of. PR coming separately for option 1; happy to do 2 or 3 instead if preferred.

## Related

- #121 (closed by PR #63) — test-quantize-fns arm64 stack-smash, tolerance bumps
- #124 (open) — test-quantize-perf windows-x64 SEGFAULT
- #125 (open) — TBQ KV-cache CPU first-decode SEGFAULT (now diagnosed: heap-stomp of vocab)
- [Cluster audit context in memory note `project_tbq_kv_cache_status.md`]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TBQ KV-cache (tbq3_0 / tbq4_0): 65-73x perplexity regression — recommend mark experimental #70

Finding

Likely root cause

Subordinate finding (ggml-org#125)

Cluster audit (snoop-kube 2026-06-04)

DFlash deliberately also uses q8_0

Recommendation

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cache-type-k/v	PPL	vs f16 baseline
`f16`	19.08	baseline
`q8_0`	19.08	lossless
`tbq3_0`	1252.30	65x worse
`tbq4_0`	1393.00	73x worse

TBQ KV-cache (tbq3_0 / tbq4_0): 65-73x perplexity regression — recommend mark experimental #70

Description

Finding

Likely root cause

Subordinate finding (ggml-org#125)

Cluster audit (snoop-kube 2026-06-04)

DFlash deliberately also uses q8_0

Recommendation

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions