scripts(dflash): deployment-parity prompt suite + bench harness by marksverdhei · Pull Request #98 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-08T10:11:43Z

Summary

Adds a deployment-parity prompt suite + bench harness so we can run the SAME prompt set on (a) our llama.cpp DFlash and (b) z-lab's reference (vLLM/SGLang) on titan and diff τ cell-by-cell.

Phase 0 verified γ master-sync fixed the n_outputs_max crash; DFlash now runs without crashing on current ht. Phase 1 (PR #97) restored the Round-12 scripts. Phase 2 (this PR) sets up the deployment-parity infrastructure — Markus chose this path over local logit-parity because it validates the reported metrics directly and snoop-kube already has titan + bake patterns from MTP.

Local baseline on gemma-4-31B-it-IQ4_XS target + gemma4-31b-it-dflash-Q6_K drafter currently reports τ ≈ 1.03 on MT-Bench (essentially just the bonus token — ~0.2% draft accept, n_accept=2 / n_drafted=945). z-lab's published Gemma τ at conc=1 / BF16 / block=16: MT 4.23, HE 8.00, GSM8K 7.53. That's the 20× gap Round-12 surfaced — this suite is how we put a number on it on identical inputs.

Files

Path	Purpose
`scripts/dflash-parity-prompts.json`	15 prompts × 3 classes (MT-Bench / HumanEval / GSM8K, 5 each). Tracked artifact so both sides run exactly the same set.
`scripts/bench-dflash-parity.sh`	Runs the suite against `llama-speculative-simple` at greedy/temp=0 with `--spec-draft-n-max=15`. Emits per-prompt `{tau, n_accept, n_drafted, decode_tps}` as JSON.

τ = n_predict / (n_predict - n_accept) — same convention z-lab uses for acceptance_lengths in dflash_generate.

Test plan

Single-prompt smoke parses correctly: τ=1.0317, n_accept=2, n_drafted=945, JSON well-formed
JSON schema validates (15 prompts, 3 classes balanced)
Full local baseline (in flight)
snoop-kube runs same JSON against vLLM/SGLang reference on titan, emits same shape

Next step (Phase 3)

Coordinate with snoop-kube to run vLLM (or SGLang per z-lab's docker README) with target=google/gemma-4-31B-it BF16 + drafter=z-lab/gemma-4-31B-it-DFlash against this prompt JSON. Two scenarios after compare:

Reference τ ≈ ours → z-lab's published numbers weren't replicable, mark DFlash production-ready as-is.
Reference τ >> ours → confirmed graph bug, unpark Path B (local logit-parity via chore/dflash-parity-dump, dump infrastructure already committed).

15 prompts × 3 classes (MT-Bench / HumanEval / GSM8K, 5 each) targeting z-lab's published Gemma τ table (MT 4.23 / HE 8.00 / GSM8K 7.53 at conc=1, BF16, block=16). Greedy temp=0, --spec-draft-n-max=15, fixed seed for reproducibility. bench-dflash-parity.sh runs the suite against llama-speculative-simple and emits per-prompt {tau, n_accept, n_drafted, decode_tps} as JSON. snoop-kube runs the SAME prompts against vLLM/SGLang with z-lab/gemma-4-31B-it-DFlash on titan and emits the same shape — we diff cell-by-cell to localize the gap. tau computed as n_predict / (n_predict - n_accept), the same convention as z-lab's dflash_generate's acceptance_lengths.

…otection - DFLASH_PARITY_NGL / NGLD: override target/drafter -ngl (default 99). Needed to fit larger targets that don't fit a single 24G card; with -ngl 35 the Q8_0 31B target loads alongside the 1.2G Q6_K drafter on one 3090. - DFLASH_PARITY_TIMEOUT: per-prompt timeout (default 240s). CPU-offload runs for BF16 targets take minutes per prompt at low GPU layer counts. - DFLASH_PARITY_THREADS: --threads cap. On centurion (etcd HA control-plane member) leave >=2 cores free so long CPU-offload runs don't add fsync latency that wobbles cluster heartbeat/leader-election. - DFLASH_PARITY_NICE: nice -n prefix (0-19). Sets the bench at minimum priority on shared boxes. Defaults preserve the prior full-GPU behavior; opt-in only.

marksverdhei added 2 commits June 8, 2026 12:11

marksverdhei merged commit 1219284 into ht Jun 12, 2026

marksverdhei deleted the chore/dflash-parity-suite branch June 12, 2026 18:36

marksverdhei mentioned this pull request Jun 12, 2026

docs(readme): complete HT Fork Changes inventory with per-change justifications #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts(dflash): deployment-parity prompt suite + bench harness#98

scripts(dflash): deployment-parity prompt suite + bench harness#98
marksverdhei merged 2 commits into
htfrom
chore/dflash-parity-suite

marksverdhei commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 8, 2026

Summary

Files

Test plan

Next step (Phase 3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant