Skip to content

scripts(dflash): switch default bench target to Q8_0 + --target flag#65

Merged
marksverdhei merged 1 commit into
htfrom
chore/bench-dflash-q8-target
Jun 4, 2026
Merged

scripts(dflash): switch default bench target to Q8_0 + --target flag#65
marksverdhei merged 1 commit into
htfrom
chore/bench-dflash-q8-target

Conversation

@marksverdhei

Copy link
Copy Markdown

Why

Per Markus 2026-06-04: DFlash quality measurement should use a Q8_0 target rather than Q4_K_M. The Q4_K_M target introduces enough quantization noise that it confounds DFlash's own accept-rate signal — we want a higher-quality reference for the speculative-decoding evaluation.

Changes

  • Default TARGET changed from gemma-4-31B-it-Q4_K_M.gguf to gemma-4-31B-it-Q8_0.gguf.
  • Added --target PATH flag for explicit per-run override.
  • Added DFLASH_BENCH_TARGET and DFLASH_BENCH_DRAFTER_DIR env vars (env-first, then CLI flag, then default).
  • Updated VRAM math in the comment block:
    • Q4_K_M ~22 GB total (single 24 GB card)
    • Q8_0 ~38 GB total (titan A100 80 GB only)
    • BF16 ~67 GB total (titan A100 80 GB only)

Verified

  • bash -n scripts/bench-dflash.sh — syntax OK
  • --help renders the updated docblock correctly
  • No other scripts depend on the old default (grepped Q4_K_M.gguf across the tree)

Follow-up

Task #110 already updated to reflect this. Next concrete step is the titan re-bake against b0daec55b (Task #109), then this bench script can run with its new default.

Per Markus 2026-06-04: DFlash quality measurement should use a Q8_0
target rather than Q4_K_M, since Q4_K_M introduces enough target-side
quantization noise to confound DFlash's own accept-rate signal. Q8_0
fits in 38 GB total, well within titan A100 80 GB.

* Default `TARGET` is now `gemma-4-31B-it-Q8_0.gguf`. Override via
  `--target PATH` or `DFLASH_BENCH_TARGET` env var.
* Also added `DFLASH_BENCH_DRAFTER_DIR` env var for consistency.
* Comment block documents VRAM math for Q4_K_M / Q8_0 / BF16 targets
  so future runs can pick the right card.
@marksverdhei marksverdhei merged commit 09b2124 into ht Jun 4, 2026
@marksverdhei marksverdhei deleted the chore/bench-dflash-q8-target branch June 4, 2026 17:52
marksverdhei added a commit that referenced this pull request Jun 12, 2026
… (#71)

Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512:

| cache-type | PPL    | vs f16 |
|------------|--------|--------|
| f16        | 19.08  | baseline |
| q8_0       | 19.08  | lossless |
| tbq3_0     | 1252.30 | 65x worse |
| tbq4_0     | 1393.00 | 73x worse |

TBQ KV-cache produces near-random output. Likely root cause is statistical:
TBQ's rotated-domain codebook was calibrated for weight distributions, not
the K/V tensor distributions seen during inference. The encoding scheme
itself cannot faithfully represent KV values.

Snoop-kube's cluster audit confirms zero deployments use tbq* KV-cache
(every host uses q8_0 or q4_0). DFlash also defaults to q8_0 (PR #65).
No production consumer exists.

This PR adds a one-line experimental note to the --cache-type-k/v and
--cache-type-k-draft/v-draft help text, referencing issue #70 for the
full data + recommendation. Code path stays in place — Markus may have
roadmap intent I'm not aware of; this just stops anyone reading --help
from assuming tbq* is a usable choice without checking.

Follow-ups if Markus prefers full removal:
* drop tbq3_0/tbq4_0 from common/arg.cpp's kv_cache_types list
* keep the ftypes (TBQ weight quantization is separate from KV use)
* close issues ggml-org#124 + ggml-org#125 as wont-fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant