scripts(dflash): deployment-parity prompt suite + bench harness#98
Merged
Conversation
added 2 commits
June 8, 2026 12:11
15 prompts × 3 classes (MT-Bench / HumanEval / GSM8K, 5 each) targeting z-lab's
published Gemma τ table (MT 4.23 / HE 8.00 / GSM8K 7.53 at conc=1, BF16, block=16).
Greedy temp=0, --spec-draft-n-max=15, fixed seed for reproducibility.
bench-dflash-parity.sh runs the suite against llama-speculative-simple and
emits per-prompt {tau, n_accept, n_drafted, decode_tps} as JSON. snoop-kube
runs the SAME prompts against vLLM/SGLang with z-lab/gemma-4-31B-it-DFlash on
titan and emits the same shape — we diff cell-by-cell to localize the gap.
tau computed as n_predict / (n_predict - n_accept), the same convention as
z-lab's dflash_generate's acceptance_lengths.
…otection - DFLASH_PARITY_NGL / NGLD: override target/drafter -ngl (default 99). Needed to fit larger targets that don't fit a single 24G card; with -ngl 35 the Q8_0 31B target loads alongside the 1.2G Q6_K drafter on one 3090. - DFLASH_PARITY_TIMEOUT: per-prompt timeout (default 240s). CPU-offload runs for BF16 targets take minutes per prompt at low GPU layer counts. - DFLASH_PARITY_THREADS: --threads cap. On centurion (etcd HA control-plane member) leave >=2 cores free so long CPU-offload runs don't add fsync latency that wobbles cluster heartbeat/leader-election. - DFLASH_PARITY_NICE: nice -n prefix (0-19). Sets the bench at minimum priority on shared boxes. Defaults preserve the prior full-GPU behavior; opt-in only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a deployment-parity prompt suite + bench harness so we can run the SAME prompt set on (a) our llama.cpp DFlash and (b) z-lab's reference (vLLM/SGLang) on titan and diff τ cell-by-cell.
Phase 0 verified γ master-sync fixed the n_outputs_max crash; DFlash now runs without crashing on current ht. Phase 1 (PR #97) restored the Round-12 scripts. Phase 2 (this PR) sets up the deployment-parity infrastructure — Markus chose this path over local logit-parity because it validates the reported metrics directly and snoop-kube already has titan + bake patterns from MTP.
Local baseline on
gemma-4-31B-it-IQ4_XStarget +gemma4-31b-it-dflash-Q6_Kdrafter currently reports τ ≈ 1.03 on MT-Bench (essentially just the bonus token — ~0.2% draft accept, n_accept=2 / n_drafted=945). z-lab's published Gemma τ at conc=1 / BF16 / block=16: MT 4.23, HE 8.00, GSM8K 7.53. That's the 20× gap Round-12 surfaced — this suite is how we put a number on it on identical inputs.Files
scripts/dflash-parity-prompts.jsonscripts/bench-dflash-parity.shllama-speculative-simpleat greedy/temp=0 with--spec-draft-n-max=15. Emits per-prompt{tau, n_accept, n_drafted, decode_tps}as JSON.τ =
n_predict / (n_predict - n_accept)— same convention z-lab uses foracceptance_lengthsindflash_generate.Test plan
Next step (Phase 3)
Coordinate with snoop-kube to run vLLM (or SGLang per z-lab's docker README) with
target=google/gemma-4-31B-itBF16 +drafter=z-lab/gemma-4-31B-it-DFlashagainst this prompt JSON. Two scenarios after compare:chore/dflash-parity-dump, dump infrastructure already committed).