Skip to content

scripts(dflash): deployment-parity prompt suite + bench harness#98

Merged
marksverdhei merged 2 commits into
htfrom
chore/dflash-parity-suite
Jun 12, 2026
Merged

scripts(dflash): deployment-parity prompt suite + bench harness#98
marksverdhei merged 2 commits into
htfrom
chore/dflash-parity-suite

Conversation

@marksverdhei

Copy link
Copy Markdown

Summary

Adds a deployment-parity prompt suite + bench harness so we can run the SAME prompt set on (a) our llama.cpp DFlash and (b) z-lab's reference (vLLM/SGLang) on titan and diff τ cell-by-cell.

Phase 0 verified γ master-sync fixed the n_outputs_max crash; DFlash now runs without crashing on current ht. Phase 1 (PR #97) restored the Round-12 scripts. Phase 2 (this PR) sets up the deployment-parity infrastructure — Markus chose this path over local logit-parity because it validates the reported metrics directly and snoop-kube already has titan + bake patterns from MTP.

Local baseline on gemma-4-31B-it-IQ4_XS target + gemma4-31b-it-dflash-Q6_K drafter currently reports τ ≈ 1.03 on MT-Bench (essentially just the bonus token — ~0.2% draft accept, n_accept=2 / n_drafted=945). z-lab's published Gemma τ at conc=1 / BF16 / block=16: MT 4.23, HE 8.00, GSM8K 7.53. That's the 20× gap Round-12 surfaced — this suite is how we put a number on it on identical inputs.

Files

Path Purpose
scripts/dflash-parity-prompts.json 15 prompts × 3 classes (MT-Bench / HumanEval / GSM8K, 5 each). Tracked artifact so both sides run exactly the same set.
scripts/bench-dflash-parity.sh Runs the suite against llama-speculative-simple at greedy/temp=0 with --spec-draft-n-max=15. Emits per-prompt {tau, n_accept, n_drafted, decode_tps} as JSON.

τ = n_predict / (n_predict - n_accept) — same convention z-lab uses for acceptance_lengths in dflash_generate.

Test plan

  • Single-prompt smoke parses correctly: τ=1.0317, n_accept=2, n_drafted=945, JSON well-formed
  • JSON schema validates (15 prompts, 3 classes balanced)
  • Full local baseline (in flight)
  • snoop-kube runs same JSON against vLLM/SGLang reference on titan, emits same shape

Next step (Phase 3)

Coordinate with snoop-kube to run vLLM (or SGLang per z-lab's docker README) with target=google/gemma-4-31B-it BF16 + drafter=z-lab/gemma-4-31B-it-DFlash against this prompt JSON. Two scenarios after compare:

  1. Reference τ ≈ ours → z-lab's published numbers weren't replicable, mark DFlash production-ready as-is.
  2. Reference τ >> ours → confirmed graph bug, unpark Path B (local logit-parity via chore/dflash-parity-dump, dump infrastructure already committed).

marksverdhei added 2 commits June 8, 2026 12:11
15 prompts × 3 classes (MT-Bench / HumanEval / GSM8K, 5 each) targeting z-lab's
published Gemma τ table (MT 4.23 / HE 8.00 / GSM8K 7.53 at conc=1, BF16, block=16).
Greedy temp=0, --spec-draft-n-max=15, fixed seed for reproducibility.

bench-dflash-parity.sh runs the suite against llama-speculative-simple and
emits per-prompt {tau, n_accept, n_drafted, decode_tps} as JSON. snoop-kube
runs the SAME prompts against vLLM/SGLang with z-lab/gemma-4-31B-it-DFlash on
titan and emits the same shape — we diff cell-by-cell to localize the gap.

tau computed as n_predict / (n_predict - n_accept), the same convention as
z-lab's dflash_generate's acceptance_lengths.
…otection

- DFLASH_PARITY_NGL / NGLD: override target/drafter -ngl (default 99).
  Needed to fit larger targets that don't fit a single 24G card; with -ngl 35
  the Q8_0 31B target loads alongside the 1.2G Q6_K drafter on one 3090.
- DFLASH_PARITY_TIMEOUT: per-prompt timeout (default 240s). CPU-offload runs
  for BF16 targets take minutes per prompt at low GPU layer counts.
- DFLASH_PARITY_THREADS: --threads cap. On centurion (etcd HA control-plane
  member) leave >=2 cores free so long CPU-offload runs don't add fsync
  latency that wobbles cluster heartbeat/leader-election.
- DFLASH_PARITY_NICE: nice -n prefix (0-19). Sets the bench at minimum
  priority on shared boxes.

Defaults preserve the prior full-GPU behavior; opt-in only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant