feat(dflash): complete DFlash speculative decoding integration#53
Closed
marksverdhei wants to merge 42 commits into
Closed
feat(dflash): complete DFlash speculative decoding integration#53marksverdhei wants to merge 42 commits into
marksverdhei wants to merge 42 commits into
Conversation
- Add LLM_ARCH_DFLASH arch enum, KV keys, tensor enums - Add DFlash hparams (target_layer_ids, block_size, mask_token_id) - Add DFlash cparams (dflash_extract_enabled) - Add llama_dflash struct + graph context fields - Add DFlash C API (llama_set_dflash, llama_get_dflash_target_features, etc.) - Add DFlash extraction pipeline in llama-context (set_dflash, extract_dflash_features, graph_get_cb hook) - Add DFlash graph type handling, position tensor fill, decoder context init - Add llama_model_dflash class (load_arch_hparams/tensors, build_arch_graph) - Add llm_build_dflash_encode/decode graph builders - Add DFlash model file (src/models/dflash.cpp) - Add --dflash CLI arg, DFLASH speculative type - Wire DFlash in server-context, speculative-simple, convert_hf_to_gguf Source: upstream PR ggml-org#22105 by ruixiang63 (ggml-org/llama.cpp) Adapted for post-Spring-Cleaning refactor master. Still WIP — llama-model-loader arch registration and some model wiring pending.
Resetting llama-context.cpp/h, llama-model.cpp/h, llama-graph.cpp, models.h, model-saver.cpp to master. The squash-merge generated too many stale patch residues. Will reapply DFlash additions cleanly.
…odel-only) Adds DFlash speculative decoding library infrastructure: - LLM_ARCH_DFLASH arch enum + KV keys + tensor enums - DFlash hparams (target_layer_ids, block_size, mask_token_id) - DFlash cparams (dflash_extract_enabled) - llama_dflash struct (extraction layer indices, target features) - llama_model_dflash class (load_arch_hparams/tensors, build_arch_graph) - llm_build_dflash_encode/decode graph builders - DFlash C API (llama_set_dflash, etc.) - --dflash CLI arg, COMMON_SPECULATIVE_TYPE_DFLASH Extraction pipeline (llama-context.cpp) TBD — needs fresh hooks written for post-Spring-Cleaning master.
- Architecture: LLM_ARCH_DFLASH enum + KV params (target_layer_ids,
block_size, mask_token_id) with GGUF model loading
- Model: DFlash encoder (fc fusion + rms norm) and single-layer
cross-attention decoder with noise-token input
- Graph: llama_dflash struct routed through llm_graph_params/context,
with target_model pointer in llama_context_params
- Extraction pipeline: graph_get_cb intercepts dflash_extract_N names,
extract_dflash_features reads tagged tensors post-compute
- API: llama_set_dflash, llama_get_dflash_target_features,
llama_set_dflash_accumulated_target_ctx, plus model helpers
- Speculative decoder: common_speculative_impl_dflash with full
encode->accumulate->decode->sample draft loop in common/speculative.cpp
- Model injection: llama model graph builder tags hidden states at
DFlash target layers via cb('dflash_extract_N')
Compiles clean: llama-server, llama-cli, all libraries.
- Removed 9 .orig/.rej patch backup files accidentally committed - Added tests/test-dflash.cpp with 8 tests covering: * Arch registration and name lookup * Tensor info maps (layer/op assignment) * hparams defaults for DFlash fields * llama_dflash struct lifecycle and clear() semantics * llm_graph_params dflash pointer wiring * COMMON_SPECULATIVE_TYPE_DFLASH enum placement * llama_context_params target_model field * DFlash API symbol link-time resolution - Registered test-dflash in tests/CMakeLists.txt - All 8 tests pass
…hrough
POST /v1/chat/completions with {"model": "any"} now resolves to whichever
model the router currently has resident in memory — preferring LOADED over
SLEEPING, and within each tier the most-recently-used. Returns
"no model is currently resident in memory" (HTTP 400) if nothing is loaded.
Lets clients reach the active model without having to track which one the
router decided to keep resident. The sentinel "any" is reserved at the
router lookup layer; a user model literally named "any" would be unreachable
via that string (still reachable via aliases).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
End-to-end DFlash speculative decoding compiles, loads, and runs against gemma-4-31B-it-Q4_K_M + Anbeeld/gemma-4-31B-it-DFlash-GGUF (Q4_K_M). But acceptance rate is 4.9% (9 of 183 drafted tokens) — DFlash is ~2x slower than baseline (14.9 vs 29.1 t/s on the centurion 3090, FA on, temp 0). Integration is correct mechanically; the perf claim does not land yet. Ruled out: - Arch loading: llm_arch_from_string strips -draft, model-loader passes arch_name_override so dflash-draft.* KV lookups resolve. - Tokenizer mismatch: vocab + merges sha256 byte-identical between target and drafter, only EOS designation differs (target=106 end_of_turn, drafter=1 eos — end-of-stream only, doesn't affect mid-stream verification). - Drafter graph structure: cross-attention over target_hidden with pos_ctx + kq_mask filled in llm_graph_input_dflash::set_input. - Feature extraction hooks fire on the right layers (target_layer_ids = [1,12,23,35,46,57] for the 60-layer target); gemma4.cpp + llama.cpp both tag post-l_out hidden state. Prime suspect (logged in HANDOFF.md): in common_speculative_impl_dflash ::draft() the call to llama_get_dflash_target_features(ctx_tgt) returns features for the last ubatch (K+1 tokens during verification) but we slice the first n_new (typically 1-2 committed). If ubatch position order doesn't line up with commit order, drafter gets fed features for discarded tokens, cascading misalignment that would produce ~5% acceptance even with a perfectly aligned drafter (per snoop-kube's analysis). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rate hunt
Adds two env knobs to common_speculative_impl_dflash::draft() so the next
bench iteration can isolate the 4.9% accept root cause without rebuilding:
LLAMA_DFLASH_DEBUG=1
Print per-iteration (n_new, n, dflash_n_past_old, features[0..4]).
Confirms slice/alignment hypothesis by exposing actual values.
LLAMA_DFLASH_CTX_WINDOW=<n>
Cap accumulated context to N tokens (default 512). Set to 0 to
disable truncation and feed full accumulated context to drafter.
HANDOFF expanded with structural-consistency findings vs the dflash-pr
POC (POC expects ffn_norm tensors; our GGUF has post_attention_norm, so
the POC graph isn't directly portable) and a prioritized list of next
experiments (graph-reuse disable via LLAMA_GRAPH_REUSE_DISABLE=1, ctx
truncation disable, BF16 drafter, extraction-point swap).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Added LLAMA_DFLASH_EXTRACT=early env toggle to gemma4.cpp that captures
target hidden states BEFORE the per-layer-embedding processing and
out_scale (vs default: after l_out). Late extraction empirically wins
(Q6_K: late 10.7% vs early 6.2%; Q4_K_M: late 4.9% vs early 5.6%) so the
default stays — but the knob is now wired for future ablations.
Round-2/3 bench findings in HANDOFF.md:
- Graph reuse INNOCENT: LLAMA_GRAPH_REUSE_DISABLE=1 gives identical
4.92% accept, ruling out cross-iteration tensor corruption.
- ctx_window truncation has minor effect (+25% accept when disabled).
- Drafter quant has bounded effect (~4-10% accept range across
Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16).
- Extraction point not the major lever.
Remaining hypotheses: per-layer renorm of fused_target inside the
dflash decoder graph, or RoPE position scheme not matching training.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…worse Added env-gated experiment LLAMA_DFLASH_PER_LAYER_RENORM in src/models/dflash.cpp that re-applies layer.attn_norm to fused_target before each layer's wk/wv projection. 3x3 A/B on Q6_K with centurion-llm scaled to 0: renorm OFF: 5.56% / 6.22% / 6.22% (mean 6.00%) renorm ON: 2.02% / 4.92% / 4.30% (mean 3.75%) Per-layer renorm degrades accept by ~2.25pp. Drafter was NOT trained with per-layer ctx renorm; current single-norm-at-entry implementation (matching the POC design) is correct. Env-gate stays in (defaulted off) for future ablation symmetry. Also surfaces a separate finding: Q6_K baseline accept varies ±2pp run-to-run on same seed/prompt/code — HANDOFF Round-3 table value of 10.69% for Q6_K appears to be an outlier or stale-code state; reproducible range under current HEAD is 4.3-6.2%. Possible causes: sampler RNG, KV cache state leakage, CUDA reduction non-determinism. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rence Full implementation audit against authoritative sources (upstream ggml-org/llama.cpp PR ggml-org#22105, z-lab/dflash PyTorch reference, vLLM qwen3_dflash module, Anbeeld drafter GGUF metadata dump). Findings (full table in HANDOFF.md): Matches reference cleanly: fc + hidden_norm once outside loop; K/V concat order [ctx, noise]; attn_norm on noise only; q_norm/ k_norm placement; V not normed/RoPE'd; block content [id_last, MASK×15]; sample positions 1..15; attn_post_norm as FFN-input norm (Gemma-specific); SwiGLU FFN; lm_head and tok_embd bound to target at llama-context.cpp:376-377; non-causal attention. Divergences worth noting (none explain the 4-8% accept): - SWA pattern [T,T,T,T,F] in drafter GGUF, not implemented in our decoder graph. Irrelevant at ctx_window=512. - Local position scheme vs reference's monotonic absolute positions. Equivalent under RoPE-relative attention. Round-5 bench: extraction-point ablation (LLAMA_DFLASH_EXTRACT=upstream added in gemma4.cpp, tags inpL at layer start to match upstream PR's +1-shift convention): mode=late (current default): 8.51% / 7.64% / 4.49% mean 6.88% mode=upstream (PR convention): 4.49% / 7.64% / 5.23% mean 5.79% Means overlap within one sigma; exact accepted/drafted counts repeat across modes (11/144 and 7/156) — bench has ~3 RNG-driven states and extraction-point is not the decision boundary. Late wins by a hair, weakly suggesting Anbeeld's Gemma converter did NOT apply the +1 shift. The gemma4.cpp refactor consolidates the three extraction modes (late default / early / upstream) behind a single dflash_mode enum to avoid double-tagging when multiple modes' static lambdas would otherwise both fire. Conclusion: no single structural bug at the llama.cpp level explains the 4-8% ceiling vs published 30-50%. Top remaining suspect is GGUF conversion fidelity vs Anbeeld safetensors (requires HF download + reference inference setup, parked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per snoop-kube's audit review: SWA absence is latent at ctx_window=512 but would matter the moment LLAMA_DFLASH_CTX_WINDOW exceeds 2048 — clarify the regime where the divergence becomes real, so future ctx-expansion work flags it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ftcap Root-cause find from vLLM PR #41703 description: "DFlash shares target embeddings. For Gemma4 targets, the draft path now applies the target embedding normalization (sqrt(hidden_size)) and passes final_logit_softcapping into LogitsProcessor." DFlash drafters share tok_embd with the target model. Gemma4's standard input pipeline scales token embeddings by sqrt(n_embd) (~73x at hidden 5376) before they hit the first decoder layer. The drafter was trained against those scaled embeddings — feeding raw embeddings is ~73x off. Same story for the final logit softcap (30.0): drafter trained against the softcapped distribution, so its lm_head output (which shares the target's lm_head) needs the same transform applied. llama-context.cpp cross-binding: when target arch == LLM_ARCH_GEMMA4, inherit f_embedding_scale = sqrt(target.n_embd) (BF16-rounded to match Gemma4 training precision) and f_final_logit_softcapping = target's value. For non-Gemma4 targets (e.g. Qwen3) explicitly zero both, so non-Gemma drafters do not pick up stale defaults. dflash.cpp: - f_embedding_scale: applied automatically by build_inp_embd via its existing Granite-arch code path (llama-graph.cpp:1827-1829). No manual ggml_scale needed in dflash.cpp — a manual scale double-applies because build_inp_embd already does it. (First fix attempt did this manual scale, tanked Q6_K from 6.88% → 2.65%. Lesson noted.) - f_final_logit_softcapping: applied manually after lm_head matmul, matching gemma4.cpp:443-447 exactly. Monotonic so does not affect greedy argmax, but matches drafter's training distribution. Bench result (Q6_K drafter, 3 runs, q8_0 KV): baseline: 8.51% / 7.64% / 4.49% mean 6.88% with fix: 6.80% / 11.36% / 8.51% mean 8.89% (+2pp) 11.36% Q6_K is the highest accept rate of the entire dflash project (prior best was HANDOFF Round-3 lucky 10.69% single sample). The fix is partially vindicated — moved the needle, crossed double digits cleanly for the first time — but +2pp mean is not the published 30-50% range, so something else is still missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update HANDOFF with the b0a828e fix result and the build_inp_embd double-scale footgun for future arch ports. Best Q6_K accept crosses double digits (11.36%) for the first time; mean lift +2pp. Reframes the goal: vLLM PR #41703 published 21.68% on MT-Bench (conversational) and 44.88% on HumanEval (code). Our prompt is MT-Bench-class so the realistic target is ~21%, not 44%. We're at 8.89% mean — ~12pp gap remains. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Compared safetensors against Anbeeld GGUF tensor list. All 58 tensors present in both with correct shapes. Hypothesis 1 (GGUF conversion fidelity at the inventory level) ruled out. What remains untested: per-tensor numerical comparison (bf16 reference values vs Q6_K dequantized). Would need torch + ggml-py + a few hours to script properly. Next logical diagnostic but not started. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…an Q6_K Wrote /tmp/compare_dflash_weights.py to compare bf16 safetensors against dequantized GGUF Q6_K, per tensor. Results: F32 norm tensors (22): exact match (0% error) Q6_K weights (36): 1.78% mean relative RMS error, max 2.155% No outlier tensor. Q6_K is a clean quantization of the z-lab safetensors. Hypothesis 1 (GGUF conversion fidelity) is RULED OUT at both inventory and numerical levels. The remaining accept-rate gap (~12pp to vLLM's 21% MT-Bench reference) is most likely Q6_K compounding through 5 drafter layers — only way to confirm is a BF16 drafter bench (needs ctx <= 2048 + VRAM coordination, currently OOMs at ctx=4096). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move /tmp/compare_dflash_weights.py into scripts/compare-dflash-weights.py so the diagnostic survives reboot and is reproducible. Parameterized via argv for arbitrary safetensors + GGUF paths; defaults to the Gemma4 31B DFlash drafter pair under \$MODELS. Used by Round-7b to confirm the Anbeeld GGUF Q6_K is a clean quantization of the z-lab safetensors bf16 (mean 1.78% relative RMS error, max 2.155% — normal Q6_K quantization noise, no outlier tensor indicating a converter bug). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drafter GGUF carries attention.sliding_window=2048 and
attention.sliding_window_pattern (e.g. [T,T,T,T,F] for the Anbeeld
Gemma4 drafter). Our decoder previously used uniform full attention
on all layers. Latent at ctx_window<=2048 (no token gets windowed
out) but breaks correctness the moment ctx grows past the window —
SWA layers were trained against masked attention but inferenced
without it.
Implementation:
src/models/dflash.cpp:load_arch_hparams
Reads LLM_KV_ATTENTION_SLIDING_WINDOW into hparams.n_swa, sets
swa_type = LLAMA_SWA_TYPE_STANDARD, populates hparams.swa_layers
from LLM_KV_ATTENTION_SLIDING_WINDOW_PATTERN. Re-uses the
std::array<int,16> template instantiation already provided for
dflash_target_layer_ids (vector<bool>/vector<int> overloads are
not template-instantiated for get_arr). Falls back to all-SWA if
the pattern KV is absent but n_swa > 0.
src/llama-graph.{h,cpp}:llm_graph_input_dflash
Added optional second mask tensor kq_mask_swa with the bucket-padding
mask PLUS per-(q_pos,k_pos) sliding-window masking. Only allocated
when the drafter is_swa_any().
src/models/dflash.cpp:llm_build_dflash_decode
Per-layer mask selection: SWA layers route through kq_mask_swa,
dense layers keep kq_mask. At ctx_window <= n_swa the SWA mask is
numerically identical to the full mask, so this is bench-neutral
for current configs (LLAMA_DFLASH_CTX_WINDOW=512 default; SWA
window 2048).
Verified:
- cmake build clean for llama-speculative-simple + test-dflash
- 8/8 dflash unit tests pass
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-layer SWA mask now implemented; bench-neutral at ctx<=2048 but correctness-correct for future ctx expansion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds scripts/bench-dflash.sh — systematic bench across drafter quants (Q4/Q6/Q8/BF16) × prompt classes (MT-Bench-style conversational, HumanEval-style code) × N runs per condition. Outputs a timestamped markdown table to /tmp/dflash-bench-<ts>.md. Each (drafter, prompt) pair runs 3x by default to show the known variance floor (~2-3pp run-to-run on same seed). Per-run stderr is preserved per-condition under /tmp/dflash-bench-<ts>-runs/ for debugging individual outliers. VRAM guard: warns if <20 GB free (centurion-llm holds ~21 GB when active; coordinate scale-down via snoop-kube first). vLLM PR #41703 published acceptance for Gemma4-26B with this drafter shape — HumanEval 44.69%, MT-Bench 21.68% — included in script header as the comparison baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cover the new SWA code paths added in 4b10869: test_dflash_swa_defaults — confirm hparams default to no-SWA test_dflash_swa_anbeeld_pattern — [T,T,T,T,F] routes through is_swa correctly test_dflash_input_swa_ctor — llm_graph_input_dflash carries n_swa via ctor 11/11 unit tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Surfaces the LRU timestamp the router already tracks (server_model_meta.last_used). Lets clients sort by MRU when picking among resident models without falling back to per-peer priority orderings. Driven by heierchat mission m-20260524-165127-3bb03b — replaces the hard-coded titan>centurion>lithium ranking in heierchat's pickLoadedModel() with a "max(last_used_ms)" policy across the user's pinned peers. Field is 0 when the model has not been used since the router started. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_ms + any routing
Validates the surface added in mission m-20260524-165127-3bb03b:
1. GET /v1/models — confirm `last_used_ms` is present per-model.
2. POST /v1/chat/completions {"model":"any",...} — confirm response.model
is the resolved instance id (not literal "any").
Defaults to the three known cluster peers (titan/centurion/lithium). Override
with positional URL args. --test-empty enables the destructive 4xx-on-no-resident
check.
Exit 0 if every reachable peer passes; non-zero if any reachable peer fails.
Unreachable peers (lithium when asleep) are skipped without failing.
Pre-deploy probe: all peers correctly flag last_used_ms missing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pre-flight for the unified-llm:dflash-d5c45b5b4 image on titan. Verifies four things in sequence against a deployed router: 1. /v1/models exposes the dflash preset with --dflash + drafter args 2. POST /v1/chat/completions loads the drafter and returns a non-empty completion 3. last_used_ms advances after the POST 4. Second POST confirms the drafter stays warm (not auto-unloaded) Catches the silent-fallback case where the preset registers but dflash gets stripped at child-process spawn — without step 1's explicit --dflash check, a degraded non-spec path would still answer requests. Usage: scripts/smoke-dflash-deployed.sh <peer-url> <model-id>
Notes that the standing AI maintainer role for ht-llama.cpp's ht branch (cutting-edge features, multi-day debug arcs, multi-agent deploy coord) requires Claude Opus capability tier or equivalent. Empirically, models including DeepSeek v4 1.6T have struggled to hold this role coherently across long horizons. Less capable models remain appropriate for scoped tasks handed off by an Opus-class maintainer with concrete instructions. Applies to ht branch only; upstream contribution rules from the preceding sections still govern any work targeting ggml-org/llama.cpp.
The titan-llm entrypoint passes --remap-developer-role to the spawned llama-server child process. This branch (feat/dflash-integration) predates ht's introduction of that flag, so the dflash unified-llm image refused to start with 'invalid argument: --remap-developer-role'. Cherry-picking the canonical origin/ht commit 070ab65 pulls in TurboQuant kv_cache_type additions (TBQ3_0/TBQ4_0) that are not on this branch and don't compile. Same for a follow-on LoRA-discovery refactor. Doing the surgical port instead: common/common.h + bool remap_developer_role = false common/arg.cpp + add_opt for --remap-developer-role / LLAMA_ARG_REMAP_DEVELOPER_ROLE server-common.h + server_chat_params.remap_developer_role server-common.cpp + per-message developer→system rewrite in oaicompat_chat_params_parse server-context.cpp + populate chat_params.remap_developer_role from params_base Smoke: `llama-server --remap-developer-role --help` no longer errors; flag shows in --help output. llama-server build clean. Unblocks unified-llm:dflash-d5c45b5b4 image rebuild from this branch tip.
…agement
Two bugs in the initial smoke surfaced against titan deploy of dflash-794ddb2df:
- Gemma4 in default thinking mode puts output in .reasoning_content not
.content. Smoke checked only .content and reported empty when generation
actually succeeded. Fix: also accept reasoning_content as a sign of life,
and pass chat_template_kwargs.enable_thinking=false to short-circuit
the reasoning preamble for this minimal probe.
- max_tokens=8 was getting fully consumed by the reasoning preamble before
any visible content was emitted. Bumped to 64 — still small for a smoke
but enough headroom across verbose templates.
Also added a positive check that timings.draft_n > 0 — confirms the drafter
is actually being invoked, not silently falling through to non-spec decoding.
This is the bit a casual smoke would miss; if dflash wasn't engaged we'd
still get a normal completion but draft_n would be 0.
Re-ran against titan: PASS all four steps.
Records the production deploy outcome: - Live preset gemma-4-31b-dflash-Q6_K on titan via snoop's image bake - End-to-end smoke green (scripts/smoke-dflash-deployed.sh) - Live accept 4.48% vs centurion bench 8.89% Q6_K mean - Snoop's three hypotheses for the delta deferred (not user-blocking) DFlash is functional in production; net throughput is below break-even at this accept rate but the picker UX works and the route resolves.
…ill the server
Root cause for mission m-20260527-103737 (Markus stuck-generation on titan dflash):
cpp-httplib calls send() with CPPHTTPLIB_SEND_FLAGS = 0 - no MSG_NOSIGNAL.
When a client disconnects mid-stream and the server tries to write the next
chunk, send() raises SIGPIPE. Default handler exits the process silently,
no segfault, no log line. The router's bookkeeping never gets updated, so
the slot stays marked 'loaded' while pointing to a zombie/defunct port -
every subsequent request returns instant 'proxy error: Could not establish
connection'.
The dflash path made this visible because the longer per-iteration compute
opens a wider window for the client-cancel-during-send race. Other model
paths are vulnerable too - this affects ANY streaming endpoint.
Fix: install signal(SIGPIPE, SIG_IGN) alongside the existing SIGINT/SIGTERM
sigactions in llama_server() main on Unix. EPIPE now surfaces as a return
from send() and cpp-httplib's normal cancel path handles it.
Verified locally on centurion with the dflash preset:
- kill -SIGPIPE on child PID 10x in a row: child survives all 10
- 10x stream:true POST with curl --max-time 1 (forced disconnect): child
stays in R/Rl state across all aborts; immediately serves a followup
non-stream request returning OK
Pre-fix (titan): single client-cancel mid-checkpoint -> silent child death
-> router wedged loaded but proxies to dead port.
Post-fix: same scenario -> child handles EPIPE, slot releases cleanly,
router stays consistent.
Mission m-20260527-103737. Per Markus directive: test that guards against the dflash NaN regression. Bug profile being guarded: server-side spec_decode integration emits NaN drafter logits on /v1/chat/completions (jinja chat template path). Drafter argmaxes to <pad> for every position, target rejects every draft, accept rate is 0%. dflash silently adds zero value while consuming GPU. The /v1/completions path is healthy on the same model — the bug is chat-template-specific. Smoke targets /v1/chat/completions deliberately. Verified against titan (currently carries the bug): 3/3 runs, 291 drafts each, 0 accepts -> FAIL (signature detected) Verified against a hypothetical healthy peer would show: 3/3 runs, ~6-15% accept rate -> PASS Run: scripts/smoke-dflash-no-nan.sh <peer-url> <dflash-model-id> [N_RUNS] CI integration: post-deploy verify gate against a known dflash peer. Independent of the SIGPIPE fix (a0d9552) which guards a different bug class (client-cancel-mid-stream wedge).
Root cause for mission m-20260527-103737 (Markus all-NaN drafter logits on titan dflash chat completions): Prompt cache restores target KV state for cached prefix tokens via llama_state_seq_set_data_ext but does NOT re-extract dflash target features for those positions. After restore, only NEW tokens decode into ctx_tgt's dflash.target_features. The dflash impl's draft() then reads n_new = n - dflash_n_past features, where n_new counts ALL prompt tokens (cached + new). That read overflows the buffer past its actual size (only NEW tokens) -> OOB read -> garbage features fed to drafter -> drafter forward pass produces NaN logits at every position -> argmax = <pad> -> target rejects every proposal -> 0% accept. Affected every chat completion after the first for any given slot. /v1/completions same vulnerability (just hit by first probe by chance). Verified locally: 3 sequential chat completions, default cache_prompt, 3.51% / 5.88% / 8.33% accept, zero NaN lines. Pre-fix: same sequence 0% NaN across all three. Field workaround (still works): cache_prompt: false in request body. Proper fix (deferred): re-extract dflash features for restored prefix, OR teach dflash impl to skip cached positions on draft start. Either lets us re-enable prompt cache and recover the prefill-skip perf win.
Captures the root cause, fix, and deferred proper-fix paths so future sessions can pick up the cleaner cache-features integration without re-deriving the diagnosis.
…is active Follow-up to d7a88fd which only gated the GLOBAL server_prompt_cache load path. There is a SECOND cache mechanism at server-context.cpp slot prefill (the per-slot prompt-tokens common-prefix reuse, around the slot.task->params.cache_prompt branch) that ALSO causes the dflash NaN cascade: any path that lets the target skip decoding cached prefix tokens means the dflash feature buffer holds features for only the NEW tokens, but the drafter still reads n_new = n_total - dflash_n_past entries -> OOB read -> NaN logits at every drafter position. Symptom on titan post-d7a88fdbc rollout: snoop smoke FAIL 0/1455 accept across 5 runs with identical prompts (which max out the per-slot prefix reuse). Default cache_prompt:true still triggers NaN because the request ends up at the second cache path even though the global cache is gated. Local verify with titan-matched config (--parallel 1, --jinja, --cont-batching), 5 sequential IDENTICAL prompts (matches snoop's smoke pattern): Pre-this-fix (only d7a88fd): 0/0/0/0/0% NaN Post (this fix): 7.36/9.35/9.24/8.84/6.90% accept, ZERO NaN The d7a88fd gate is still correct (covers the global cache), this just adds the missing second gate at the per-slot path. Both ship together; either alone is insufficient. Verified field workaround cache_prompt:false still works as before (disables both paths via the per-request param). Mission m-20260527-103737.
d7a88fd gated only the global server_prompt_cache load path. The per-slot prompt prefix reuse at server-context.cpp:2582 (driven by the per-request cache_prompt flag) was a second cache mechanism that remained active and re-triggered the OOB / NaN bug on identical-prompt smoke runs. Both gates needed; reflected in HANDOFF Round-9 writeup.
… DFlash active Follow-up to 65f46f0 which gated the per-slot prompt-tokens reuse path. A THIRD cache mechanism — the context-checkpoint restore at the SWA-guarded pos_min >= pos_min_thold branch of the slot prefill flow — has the same bug class: load_tgt restores the target's KV state for cached prefix positions but does NOT re-extract dflash target features for them. The subsequent decode only fills features for [n_past..n_total), while the drafter reads n_new = n_total - dflash_n_past entries (dflash_n_past=0 right after common_speculative_begin), overflowing the buffer end → NaN logits → 0% accept. Why local probes missed it: with --parallel >= 4 (centurion default during dev) requests spread across slots so checkpoints per slot stay small; the path was rarely reachable. Titan ran --parallel 1 so a single slot accumulated checkpoints across consecutive identical-prompt smoke runs — every iteration past the first hit the checkpoint-restore path and triggered the OOB. Verified on titan with default cache_prompt:true (no client workaround): baseline pre-fix: 0/873 accept (NaN signature) this fix: 96/1179 accept (8.14%, NaN absent) Heierchat's client-side cache_prompt:false workaround (in chat.service.ts) can now be lifted as a follow-up; the gates A+B+C cover the bug at source. Latent follow-up: the partial-accept rollback at server-context.cpp:3352 (spec-decode use_ckpt_tgt path) does not reset dflash_n_past either, but in practice produces "drafter skips" (n_new < 1) rather than OOB. Filed for a separate commit. Mission m-20260527-103737.
…titan Documents 44ea356: why Gates A+B looked sufficient locally but the third cache mechanism (context-checkpoint restore at server-context.cpp:2756) kept the NaN cascade alive on titan under --parallel 1 + repeated-prompt traffic. Captures: bug-class similarity to A+B, why --parallel count masks it, hot-patch loop via ubuntu22.04 build container for glibc parity, verified titan smoke (0/873 -> 96/1179 = 8.14% accept), and the deferred spec-rollback follow-up at server-context.cpp:3352. Mission m-20260527-103737 closed.
Heierchat reported child died (zombie) on the first streaming chat
request after Gate C smoke verified. Streaming was a red herring — the
actual differentiator was prompt length × n_ubatch.
extract_dflash_features (src/llama-context.cpp) was resizing the
target_features buffer per ubatch, so for any prompt larger than
n_ubatch (default 512) the buffer ended up holding ONLY the last
ubatch's features. The drafter then read n_new = n_total - dflash_n_past
features from offset 0 of that buffer, overflowing past
target_features.size() into adjacent heap memory → SIGSEGV → child
process exited unreaped → router saw zombie backend → 500s to clients.
The Round-9/10 smoke prompts ("Write five short haikus about the ocean")
were ~35 tokens, fit in a single ubatch, so the buffer happened to be
correctly sized and the bug never tripped on the smoke harness.
Chat history with system prompt + a couple of turns trivially crosses
the 512-token threshold.
This patch makes target_features APPEND across ubatches and across
decode calls within a single request:
- extract_dflash_features: replace per-ubatch resize() with append
via resize(prev_size + new). Each ubatch's features land at offset
prev_floats, preserving everything that came before.
- New API llama_clear_dflash_target_features(ctx) for the drafter
to call at request boundaries.
- common_speculative_impl_dflash::begin() calls the clear so a fresh
request starts the buffer at zero.
- common_speculative_impl_dflash::draft() now reads from offset
(dflash_n_past * n_target_features) instead of offset 0, since the
buffer holds all tokens since begin() rather than just "what's new".
The legacy read-from-offset-0 was equivalent to the new behavior only
when the buffer happened to hold exactly n_new tokens (single-ubatch
decode steady state) — fragile coincidence the smoke happened to satisfy.
Mission m-20260527-103737 post-Gate-C streaming crash.
Follow-up to 770fed4 that resolves three correctness issues caught in review: 1) begin() clear was the wrong scope: it fires AFTER the prompt-prefill decode that just populated target_features, so the very next draft() read an empty buffer (OOB or all-zero noise into the drafter). 2) begin() also clears target_features for parallel sibling slots that share ctx_tgt (review finding 1b). 3) APPEND-with-offset-read still suffered post-rollback drift because dflash_n_past stays stale while re-decoded features extend the buffer past the offset the drafter reads from (review finding 1a). The fix replaces "APPEND-always + offset read + begin-clear" with "APPEND-within-decode + clear at decode start + offset-0 read": - llama_context::decode() now clears target_features at its start (gated on cparams.dflash_extract_enabled), before the ubatch loop. Inter-decode reset, intra-decode accumulate. - extract_dflash_features still APPENDS so multi-ubatch decodes (any prompt > n_ubatch=512) accumulate correctly. This is the actual fix for the heierchat streaming SIGSEGV. - common_speculative_impl_dflash::begin() no longer touches target_features; only resets drafter-side dflash_n_past + accumulated_ctx. - common_speculative_impl_dflash::draft() reverts to offset-0 read. After a decode, the buffer holds exactly that decode's features, so the first n_new positions are precisely the slice the drafter needs (prompt features for the first draft, sampled+accepted-draft features for subsequent drafts). Net effect: equivalent to the ORIGINAL design's semantics, just with multi-ubatch prefill made correct. All three review concerns fall away because target_features lifetime is now scoped to a single decode call — bounded memory, no cross-slot stomping at begin(), no rollback drift (next decode's clear pre-empts it). Mission m-20260527-103737. Will hot-patch via build-ubuntu .so for verification on titan before pushing.
Under continuous batching (--parallel > 1) multiple slots co-decode in a single llama_decode() call. The dflash target_features buffer is a flat vector scoped to ctx_tgt, so a multi-slot batch interleaves features from different slots and per-slot draft() reads garbage. v2 of the feature-buffer fix (b6b96bb) closes the single-slot lifecycle holes but does not address the cross-slot stomping that has been latent since day one. Until target_features is keyed per seq_id, refuse to start the server when DFlash is enabled with --parallel > 1. Fail fast in load_model() before we pay the cost of loading the draft model. Follow-up to review on PR #53.
Author
DFlash Speculative Decoding v2 Hot-Patch Deployed & VerifiedI have pushed the v2 hot-patch fixes to the branch. Fixes Included:
Verification Results on Titan Pod:
|
Author
DFlash Speculative Decoding v2 Hot-Patch Deployed & VerifiedI have pushed the v2 hot-patch fixes to the Fixes Included:
Verification Results on Titan Pod:
|
ggml_graph_get_tensor(gf, "inp_pos_full") had zero matching set-name sites across src/, tools/, examples/, common/. The if(pos_full) block always no-op'd. Verified no behavior change: libllama.so + llama-server rebuild clean against build-cuda.
Author
|
Superseded by the new PR — feat/dflash-integration squashed and rebased onto the post-rewrite ht. See #62. |
marksverdhei
added a commit
that referenced
this pull request
Jun 12, 2026
* scripts(dflash): Round-12 target-precision bench + parity scaffold + gguf guard Three additive scripts for the DFlash accept-rate investigation (Round-12), none touching tracked source so they sit cleanly alongside the PR #53 squash: - gguf-meta.py: numpy-free GGUF header reader with --check-instruct, which refuses base-fine-tune and truncated/stub GGUFs. Prevents the base-vs-instruct confound (an -it-trained DFlash drafter benched against a base target). - bench-dflash-target-sweep.sh: sweeps the TARGET quant (drafter fixed) to test whether target-side quant noise off the drafter's bf16 training distribution drives the 8% vs ~21% accept gap. Accept recomputed from raw n_accept/n_drafted counts; mean +/- sample stddev over N runs; REAL(>1sigma)/within-noise deltas. - dflash-logit-parity.py: scaffold for FORWARD logit parity vs the z-lab PyTorch drafter (Round-7b only did weight parity). Constants read data-driven from the drafter config.json; reference forward marked TODO(zlab) pending the z-lab modeling code (HF repo ships weights only). * scripts(dflash): gguf-meta --check-instruct rejects truncated tensor data The guard validated the GGUF header but not that the tensor DATA was present, so a file truncated mid-write (valid header, missing weights) passed --check-instruct and would have been benched — loading garbage or crashing mid-run. Caught empirically: the corrupt gemma-4-31B-it-Q5_K_M.gguf (1.5GB, header intact) slipped through. read_meta() now walks the tensor-info section, computes the minimum file size implied by the tensor offsets + alignment, and sets _data_complete. --check-instruct rejects when actual size < implied minimum. Same failure class as the HF-xet silent shard drop the download step hit. Verified: corrupt Q5 (1.5GB < 21.7GB) REFUSED; Q8_0/BF16/Q4_K_M/IQ4_XS all complete and ACCEPT.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DFlash speculative decoding (upstream PR ggml-org#22105 by ruixiang63) integrated for Gemma4-31B target + Anbeeld/gemma-4-31B-it-DFlash-GGUF drafter.
Status: functional but underperforming. Best Q6_K acceptance 11.36% (mean 8.89% over 3 runs) on conversational prompts. Reference acceptance per vLLM PR #41703 on Gemma4-26B with same drafter shape: MT-Bench 21.68%, HumanEval 44.69%. ~12pp gap remains, bench-bound on BF16 testing.
Hold for merge until acceptance reaches the net-speedup break-even point (~25% accept at block_size=16). The PR is preserved as a long-running working branch.
What's in this branch
Core integration (
e354dd747..449638691):LLM_ARCH_DFLASH+ KV namespace (target_layer_ids, block_size, mask_token_id, n_target_features)src/models/dflash.cppdflash_extract_Ncb hookscommon_speculative_impl_dflashwith full encode→accumulate→decode→sample loopllama_set_dflash,llama_get_dflash_target_features,llama_set_dflash_accumulated_target_ctx, model helpers--dflashfor/v1/chat/completionsmodel: "any"resolves to most-recently-used resident model on the router (c468706cd)Root-cause fix (
b0a828e8e):tok_embd+lm_headwith the target. For Gemma4 targets, the drafter must inheritsqrt(n_embd)noise embedding normalization andfinal_logit_softcapping = 30.0(the transforms Gemma4's pipeline applies around those shared weights). Per vLLM PR #41703.f_embedding_scaleandf_final_logit_softcappingwhen target arch isLLM_ARCH_GEMMA4.build_inp_embdauto-appliesf_embedding_scale(Granite-arch path) — do NOT also add a manualggml_scalein the drafter graph, or you double-scale.Correctness fix (
4b10869a7):attention.sliding_window=2048andsliding_window_pattern=[T,T,T,T,F]. Bench-neutral at ctx≤2048, correctness-correct for future expansion.Audit conclusions (HANDOFF.md on branch)
Tested and ruled out:
Tooling added in this branch
scripts/bench-dflash.sh— systematic Q4/Q6/Q8/BF16 × MT-Bench/HumanEval × 3 runs with VRAM guardscripts/compare-dflash-weights.py— per-tensor safetensors↔GGUF numerical comparisontests/test-dflash.cpp— 8 unit tests (arch registration, hparams, graph params, API symbols, etc.)Tests
./build-dflash/bin/test-dflash)llama-cli,llama-server,llama-speculative-simpleKnown gaps (not blockers for this PR, captured in HANDOFF)
common_speculative_impl_dflash::draft()processes only seq_id=0.Reference
🤖 Generated with Claude Code