Skip to content

feat(dflash): complete DFlash speculative decoding integration#53

Closed
marksverdhei wants to merge 42 commits into
htfrom
feat/dflash-integration
Closed

feat(dflash): complete DFlash speculative decoding integration#53
marksverdhei wants to merge 42 commits into
htfrom
feat/dflash-integration

Conversation

@marksverdhei

@marksverdhei marksverdhei commented May 22, 2026

Copy link
Copy Markdown

Summary

DFlash speculative decoding (upstream PR ggml-org#22105 by ruixiang63) integrated for Gemma4-31B target + Anbeeld/gemma-4-31B-it-DFlash-GGUF drafter.

Status: functional but underperforming. Best Q6_K acceptance 11.36% (mean 8.89% over 3 runs) on conversational prompts. Reference acceptance per vLLM PR #41703 on Gemma4-26B with same drafter shape: MT-Bench 21.68%, HumanEval 44.69%. ~12pp gap remains, bench-bound on BF16 testing.

Hold for merge until acceptance reaches the net-speedup break-even point (~25% accept at block_size=16). The PR is preserved as a long-running working branch.

What's in this branch

Core integration (e354dd747..449638691):

  • LLM_ARCH_DFLASH + KV namespace (target_layer_ids, block_size, mask_token_id, n_target_features)
  • DFlash encoder + decoder graph in src/models/dflash.cpp
  • Three-stage feature extraction pipeline (tag → capture → copy) via dflash_extract_N cb hooks
  • common_speculative_impl_dflash with full encode→accumulate→decode→sample loop
  • Public API: llama_set_dflash, llama_get_dflash_target_features, llama_set_dflash_accumulated_target_ctx, model helpers
  • Server flag: --dflash for /v1/chat/completions
  • model: "any" resolves to most-recently-used resident model on the router (c468706cd)

Root-cause fix (b0a828e8e):

  • Drafters share tok_embd + lm_head with the target. For Gemma4 targets, the drafter must inherit sqrt(n_embd) noise embedding normalization and final_logit_softcapping = 30.0 (the transforms Gemma4's pipeline applies around those shared weights). Per vLLM PR #41703.
  • llama-context.cpp cross-binding sets f_embedding_scale and f_final_logit_softcapping when target arch is LLM_ARCH_GEMMA4.
  • Watch out: build_inp_embd auto-applies f_embedding_scale (Granite-arch path) — do NOT also add a manual ggml_scale in the drafter graph, or you double-scale.
  • Lift: Q6_K mean 6.88% → 8.89% (+2pp), best run 11.36%.

Correctness fix (4b10869a7):

  • Per-layer SWA mask in drafter decoder. Drafter GGUF carries attention.sliding_window=2048 and sliding_window_pattern=[T,T,T,T,F]. Bench-neutral at ctx≤2048, correctness-correct for future expansion.

Audit conclusions (HANDOFF.md on branch)

Tested and ruled out:

  • Feature-slice off-by-one (Round-2)
  • Extraction-point off-by-one (Round-3, A/B within noise)
  • Per-layer renorm of fused_target (Round-4, made accept WORSE)
  • Structural divergence vs upstream PR [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105 / z-lab PyTorch / vLLM qwen3_dflash (Round-5)
  • GGUF tensor inventory miss (Round-7, 58/58 tensors map cleanly)
  • Q6_K conversion fidelity (Round-7b, max 2.155% relative RMS error vs bf16 safetensors)

Tooling added in this branch

  • scripts/bench-dflash.sh — systematic Q4/Q6/Q8/BF16 × MT-Bench/HumanEval × 3 runs with VRAM guard
  • scripts/compare-dflash-weights.py — per-tensor safetensors↔GGUF numerical comparison
  • tests/test-dflash.cpp — 8 unit tests (arch registration, hparams, graph params, API symbols, etc.)

Tests

  • All 8 DFlash unit tests pass (./build-dflash/bin/test-dflash)
  • Build clean for llama-cli, llama-server, llama-speculative-simple

Known gaps (not blockers for this PR, captured in HANDOFF)

  • Acceptance gap to vLLM reference (~12pp) — likely Q6_K quant compounding through 5 drafter layers. BF16 drafter test pending VRAM coordination.
  • Multi-seq support: common_speculative_impl_dflash::draft() processes only seq_id=0.
  • gemma4.cpp DFlash extraction has three env-gated modes (LLAMA_DFLASH_EXTRACT={late,early,upstream}) — late wins by a hair, kept as default. Other env knobs documented in HANDOFF.

Reference

🤖 Generated with Claude Code

marksverdhei and others added 30 commits May 22, 2026 03:47
- Add LLM_ARCH_DFLASH arch enum, KV keys, tensor enums
- Add DFlash hparams (target_layer_ids, block_size, mask_token_id)
- Add DFlash cparams (dflash_extract_enabled)
- Add llama_dflash struct + graph context fields
- Add DFlash C API (llama_set_dflash, llama_get_dflash_target_features, etc.)
- Add DFlash extraction pipeline in llama-context (set_dflash, extract_dflash_features, graph_get_cb hook)
- Add DFlash graph type handling, position tensor fill, decoder context init
- Add llama_model_dflash class (load_arch_hparams/tensors, build_arch_graph)
- Add llm_build_dflash_encode/decode graph builders
- Add DFlash model file (src/models/dflash.cpp)
- Add --dflash CLI arg, DFLASH speculative type
- Wire DFlash in server-context, speculative-simple, convert_hf_to_gguf

Source: upstream PR ggml-org#22105 by ruixiang63 (ggml-org/llama.cpp)
Adapted for post-Spring-Cleaning refactor master.
Still WIP — llama-model-loader arch registration and some model wiring pending.
Resetting llama-context.cpp/h, llama-model.cpp/h, llama-graph.cpp,
models.h, model-saver.cpp to master. The squash-merge generated
too many stale patch residues. Will reapply DFlash additions cleanly.
…odel-only)

Adds DFlash speculative decoding library infrastructure:
- LLM_ARCH_DFLASH arch enum + KV keys + tensor enums
- DFlash hparams (target_layer_ids, block_size, mask_token_id)
- DFlash cparams (dflash_extract_enabled)
- llama_dflash struct (extraction layer indices, target features)
- llama_model_dflash class (load_arch_hparams/tensors, build_arch_graph)
- llm_build_dflash_encode/decode graph builders
- DFlash C API (llama_set_dflash, etc.)
- --dflash CLI arg, COMMON_SPECULATIVE_TYPE_DFLASH

Extraction pipeline (llama-context.cpp) TBD — needs fresh hooks
written for post-Spring-Cleaning master.
- Architecture: LLM_ARCH_DFLASH enum + KV params (target_layer_ids,
  block_size, mask_token_id) with GGUF model loading
- Model: DFlash encoder (fc fusion + rms norm) and single-layer
  cross-attention decoder with noise-token input
- Graph: llama_dflash struct routed through llm_graph_params/context,
  with target_model pointer in llama_context_params
- Extraction pipeline: graph_get_cb intercepts dflash_extract_N names,
  extract_dflash_features reads tagged tensors post-compute
- API: llama_set_dflash, llama_get_dflash_target_features,
  llama_set_dflash_accumulated_target_ctx, plus model helpers
- Speculative decoder: common_speculative_impl_dflash with full
  encode->accumulate->decode->sample draft loop in common/speculative.cpp
- Model injection: llama model graph builder tags hidden states at
  DFlash target layers via cb('dflash_extract_N')

Compiles clean: llama-server, llama-cli, all libraries.
- Removed 9 .orig/.rej patch backup files accidentally committed
- Added tests/test-dflash.cpp with 8 tests covering:
  * Arch registration and name lookup
  * Tensor info maps (layer/op assignment)
  * hparams defaults for DFlash fields
  * llama_dflash struct lifecycle and clear() semantics
  * llm_graph_params dflash pointer wiring
  * COMMON_SPECULATIVE_TYPE_DFLASH enum placement
  * llama_context_params target_model field
  * DFlash API symbol link-time resolution
- Registered test-dflash in tests/CMakeLists.txt
- All 8 tests pass
…hrough

POST /v1/chat/completions with {"model": "any"} now resolves to whichever
model the router currently has resident in memory — preferring LOADED over
SLEEPING, and within each tier the most-recently-used. Returns
"no model is currently resident in memory" (HTTP 400) if nothing is loaded.

Lets clients reach the active model without having to track which one the
router decided to keep resident. The sentinel "any" is reserved at the
router lookup layer; a user model literally named "any" would be unreachable
via that string (still reachable via aliases).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
End-to-end DFlash speculative decoding compiles, loads, and runs against
gemma-4-31B-it-Q4_K_M + Anbeeld/gemma-4-31B-it-DFlash-GGUF (Q4_K_M).
But acceptance rate is 4.9% (9 of 183 drafted tokens) — DFlash is ~2x
slower than baseline (14.9 vs 29.1 t/s on the centurion 3090, FA on,
temp 0). Integration is correct mechanically; the perf claim does not
land yet.

Ruled out:
- Arch loading: llm_arch_from_string strips -draft, model-loader passes
  arch_name_override so dflash-draft.* KV lookups resolve.
- Tokenizer mismatch: vocab + merges sha256 byte-identical between
  target and drafter, only EOS designation differs (target=106
  end_of_turn, drafter=1 eos — end-of-stream only, doesn't affect
  mid-stream verification).
- Drafter graph structure: cross-attention over target_hidden with
  pos_ctx + kq_mask filled in llm_graph_input_dflash::set_input.
- Feature extraction hooks fire on the right layers
  (target_layer_ids = [1,12,23,35,46,57] for the 60-layer target);
  gemma4.cpp + llama.cpp both tag post-l_out hidden state.

Prime suspect (logged in HANDOFF.md): in common_speculative_impl_dflash
::draft() the call to llama_get_dflash_target_features(ctx_tgt) returns
features for the last ubatch (K+1 tokens during verification) but we
slice the first n_new (typically 1-2 committed). If ubatch position
order doesn't line up with commit order, drafter gets fed features for
discarded tokens, cascading misalignment that would produce ~5%
acceptance even with a perfectly aligned drafter (per snoop-kube's
analysis).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rate hunt

Adds two env knobs to common_speculative_impl_dflash::draft() so the next
bench iteration can isolate the 4.9% accept root cause without rebuilding:

  LLAMA_DFLASH_DEBUG=1
      Print per-iteration (n_new, n, dflash_n_past_old, features[0..4]).
      Confirms slice/alignment hypothesis by exposing actual values.

  LLAMA_DFLASH_CTX_WINDOW=<n>
      Cap accumulated context to N tokens (default 512). Set to 0 to
      disable truncation and feed full accumulated context to drafter.

HANDOFF expanded with structural-consistency findings vs the dflash-pr
POC (POC expects ffn_norm tensors; our GGUF has post_attention_norm, so
the POC graph isn't directly portable) and a prioritized list of next
experiments (graph-reuse disable via LLAMA_GRAPH_REUSE_DISABLE=1, ctx
truncation disable, BF16 drafter, extraction-point swap).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Added LLAMA_DFLASH_EXTRACT=early env toggle to gemma4.cpp that captures
target hidden states BEFORE the per-layer-embedding processing and
out_scale (vs default: after l_out). Late extraction empirically wins
(Q6_K: late 10.7% vs early 6.2%; Q4_K_M: late 4.9% vs early 5.6%) so the
default stays — but the knob is now wired for future ablations.

Round-2/3 bench findings in HANDOFF.md:
  - Graph reuse INNOCENT: LLAMA_GRAPH_REUSE_DISABLE=1 gives identical
    4.92% accept, ruling out cross-iteration tensor corruption.
  - ctx_window truncation has minor effect (+25% accept when disabled).
  - Drafter quant has bounded effect (~4-10% accept range across
    Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16).
  - Extraction point not the major lever.

Remaining hypotheses: per-layer renorm of fused_target inside the
dflash decoder graph, or RoPE position scheme not matching training.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…worse

Added env-gated experiment LLAMA_DFLASH_PER_LAYER_RENORM in
src/models/dflash.cpp that re-applies layer.attn_norm to fused_target
before each layer's wk/wv projection.

3x3 A/B on Q6_K with centurion-llm scaled to 0:
  renorm OFF: 5.56% / 6.22% / 6.22%  (mean 6.00%)
  renorm ON:  2.02% / 4.92% / 4.30%  (mean 3.75%)

Per-layer renorm degrades accept by ~2.25pp. Drafter was NOT trained
with per-layer ctx renorm; current single-norm-at-entry implementation
(matching the POC design) is correct.

Env-gate stays in (defaulted off) for future ablation symmetry.

Also surfaces a separate finding: Q6_K baseline accept varies ±2pp
run-to-run on same seed/prompt/code — HANDOFF Round-3 table value
of 10.69% for Q6_K appears to be an outlier or stale-code state;
reproducible range under current HEAD is 4.3-6.2%. Possible causes:
sampler RNG, KV cache state leakage, CUDA reduction non-determinism.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rence

Full implementation audit against authoritative sources (upstream
ggml-org/llama.cpp PR ggml-org#22105, z-lab/dflash PyTorch reference, vLLM
qwen3_dflash module, Anbeeld drafter GGUF metadata dump).

Findings (full table in HANDOFF.md):

  Matches reference cleanly: fc + hidden_norm once outside loop;
  K/V concat order [ctx, noise]; attn_norm on noise only; q_norm/
  k_norm placement; V not normed/RoPE'd; block content
  [id_last, MASK×15]; sample positions 1..15; attn_post_norm as
  FFN-input norm (Gemma-specific); SwiGLU FFN; lm_head and tok_embd
  bound to target at llama-context.cpp:376-377; non-causal attention.

  Divergences worth noting (none explain the 4-8% accept):
    - SWA pattern [T,T,T,T,F] in drafter GGUF, not implemented in our
      decoder graph. Irrelevant at ctx_window=512.
    - Local position scheme vs reference's monotonic absolute positions.
      Equivalent under RoPE-relative attention.

Round-5 bench: extraction-point ablation (LLAMA_DFLASH_EXTRACT=upstream
added in gemma4.cpp, tags inpL at layer start to match upstream PR's
+1-shift convention):

  mode=late (current default):    8.51% / 7.64% / 4.49%  mean 6.88%
  mode=upstream (PR convention):  4.49% / 7.64% / 5.23%  mean 5.79%

Means overlap within one sigma; exact accepted/drafted counts repeat
across modes (11/144 and 7/156) — bench has ~3 RNG-driven states and
extraction-point is not the decision boundary. Late wins by a hair,
weakly suggesting Anbeeld's Gemma converter did NOT apply the +1 shift.

The gemma4.cpp refactor consolidates the three extraction modes
(late default / early / upstream) behind a single dflash_mode enum
to avoid double-tagging when multiple modes' static lambdas would
otherwise both fire.

Conclusion: no single structural bug at the llama.cpp level explains
the 4-8% ceiling vs published 30-50%. Top remaining suspect is GGUF
conversion fidelity vs Anbeeld safetensors (requires HF download +
reference inference setup, parked).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per snoop-kube's audit review: SWA absence is latent at
ctx_window=512 but would matter the moment LLAMA_DFLASH_CTX_WINDOW
exceeds 2048 — clarify the regime where the divergence becomes
real, so future ctx-expansion work flags it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ftcap

Root-cause find from vLLM PR #41703 description:

  "DFlash shares target embeddings. For Gemma4 targets, the draft path
   now applies the target embedding normalization (sqrt(hidden_size))
   and passes final_logit_softcapping into LogitsProcessor."

DFlash drafters share tok_embd with the target model. Gemma4's standard
input pipeline scales token embeddings by sqrt(n_embd) (~73x at hidden
5376) before they hit the first decoder layer. The drafter was trained
against those scaled embeddings — feeding raw embeddings is ~73x off.

Same story for the final logit softcap (30.0): drafter trained against
the softcapped distribution, so its lm_head output (which shares the
target's lm_head) needs the same transform applied.

llama-context.cpp cross-binding: when target arch == LLM_ARCH_GEMMA4,
inherit f_embedding_scale = sqrt(target.n_embd) (BF16-rounded to match
Gemma4 training precision) and f_final_logit_softcapping = target's
value. For non-Gemma4 targets (e.g. Qwen3) explicitly zero both, so
non-Gemma drafters do not pick up stale defaults.

dflash.cpp:
- f_embedding_scale: applied automatically by build_inp_embd via its
  existing Granite-arch code path (llama-graph.cpp:1827-1829). No
  manual ggml_scale needed in dflash.cpp — a manual scale double-applies
  because build_inp_embd already does it. (First fix attempt did this
  manual scale, tanked Q6_K from 6.88% → 2.65%. Lesson noted.)
- f_final_logit_softcapping: applied manually after lm_head matmul,
  matching gemma4.cpp:443-447 exactly. Monotonic so does not affect
  greedy argmax, but matches drafter's training distribution.

Bench result (Q6_K drafter, 3 runs, q8_0 KV):

  baseline:  8.51% / 7.64% / 4.49%  mean 6.88%
  with fix:  6.80% / 11.36% / 8.51%  mean 8.89%   (+2pp)

11.36% Q6_K is the highest accept rate of the entire dflash project
(prior best was HANDOFF Round-3 lucky 10.69% single sample). The fix
is partially vindicated — moved the needle, crossed double digits
cleanly for the first time — but +2pp mean is not the published
30-50% range, so something else is still missing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update HANDOFF with the b0a828e fix result and the build_inp_embd
double-scale footgun for future arch ports. Best Q6_K accept crosses
double digits (11.36%) for the first time; mean lift +2pp.

Reframes the goal: vLLM PR #41703 published 21.68% on MT-Bench
(conversational) and 44.88% on HumanEval (code). Our prompt is
MT-Bench-class so the realistic target is ~21%, not 44%. We're at
8.89% mean — ~12pp gap remains.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Compared safetensors against Anbeeld GGUF tensor list. All 58 tensors
present in both with correct shapes. Hypothesis 1 (GGUF conversion
fidelity at the inventory level) ruled out.

What remains untested: per-tensor numerical comparison (bf16 reference
values vs Q6_K dequantized). Would need torch + ggml-py + a few hours
to script properly. Next logical diagnostic but not started.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…an Q6_K

Wrote /tmp/compare_dflash_weights.py to compare bf16 safetensors against
dequantized GGUF Q6_K, per tensor. Results:

  F32 norm tensors (22):  exact match (0% error)
  Q6_K weights    (36):  1.78% mean relative RMS error, max 2.155%

No outlier tensor. Q6_K is a clean quantization of the z-lab safetensors.

Hypothesis 1 (GGUF conversion fidelity) is RULED OUT at both inventory
and numerical levels. The remaining accept-rate gap (~12pp to vLLM's
21% MT-Bench reference) is most likely Q6_K compounding through 5
drafter layers — only way to confirm is a BF16 drafter bench (needs
ctx <= 2048 + VRAM coordination, currently OOMs at ctx=4096).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move /tmp/compare_dflash_weights.py into scripts/compare-dflash-weights.py
so the diagnostic survives reboot and is reproducible. Parameterized via
argv for arbitrary safetensors + GGUF paths; defaults to the Gemma4 31B
DFlash drafter pair under \$MODELS.

Used by Round-7b to confirm the Anbeeld GGUF Q6_K is a clean
quantization of the z-lab safetensors bf16 (mean 1.78% relative RMS
error, max 2.155% — normal Q6_K quantization noise, no outlier
tensor indicating a converter bug).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drafter GGUF carries attention.sliding_window=2048 and
attention.sliding_window_pattern (e.g. [T,T,T,T,F] for the Anbeeld
Gemma4 drafter). Our decoder previously used uniform full attention
on all layers. Latent at ctx_window<=2048 (no token gets windowed
out) but breaks correctness the moment ctx grows past the window —
SWA layers were trained against masked attention but inferenced
without it.

Implementation:

src/models/dflash.cpp:load_arch_hparams
  Reads LLM_KV_ATTENTION_SLIDING_WINDOW into hparams.n_swa, sets
  swa_type = LLAMA_SWA_TYPE_STANDARD, populates hparams.swa_layers
  from LLM_KV_ATTENTION_SLIDING_WINDOW_PATTERN.  Re-uses the
  std::array<int,16> template instantiation already provided for
  dflash_target_layer_ids (vector<bool>/vector<int> overloads are
  not template-instantiated for get_arr).  Falls back to all-SWA if
  the pattern KV is absent but n_swa > 0.

src/llama-graph.{h,cpp}:llm_graph_input_dflash
  Added optional second mask tensor kq_mask_swa with the bucket-padding
  mask PLUS per-(q_pos,k_pos) sliding-window masking.  Only allocated
  when the drafter is_swa_any().

src/models/dflash.cpp:llm_build_dflash_decode
  Per-layer mask selection: SWA layers route through kq_mask_swa,
  dense layers keep kq_mask. At ctx_window <= n_swa the SWA mask is
  numerically identical to the full mask, so this is bench-neutral
  for current configs (LLAMA_DFLASH_CTX_WINDOW=512 default; SWA
  window 2048).

Verified:
- cmake build clean for llama-speculative-simple + test-dflash
- 8/8 dflash unit tests pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-layer SWA mask now implemented; bench-neutral at ctx<=2048
but correctness-correct for future ctx expansion.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds scripts/bench-dflash.sh — systematic bench across drafter quants
(Q4/Q6/Q8/BF16) × prompt classes (MT-Bench-style conversational,
HumanEval-style code) × N runs per condition. Outputs a timestamped
markdown table to /tmp/dflash-bench-<ts>.md.

Each (drafter, prompt) pair runs 3x by default to show the known
variance floor (~2-3pp run-to-run on same seed). Per-run stderr is
preserved per-condition under /tmp/dflash-bench-<ts>-runs/ for
debugging individual outliers.

VRAM guard: warns if <20 GB free (centurion-llm holds ~21 GB when
active; coordinate scale-down via snoop-kube first).

vLLM PR #41703 published acceptance for Gemma4-26B with this drafter
shape — HumanEval 44.69%, MT-Bench 21.68% — included in script header
as the comparison baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cover the new SWA code paths added in 4b10869:
  test_dflash_swa_defaults — confirm hparams default to no-SWA
  test_dflash_swa_anbeeld_pattern — [T,T,T,T,F] routes through is_swa correctly
  test_dflash_input_swa_ctor — llm_graph_input_dflash carries n_swa via ctor

11/11 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Surfaces the LRU timestamp the router already tracks (server_model_meta.last_used).
Lets clients sort by MRU when picking among resident models without falling
back to per-peer priority orderings.

Driven by heierchat mission m-20260524-165127-3bb03b — replaces the
hard-coded titan>centurion>lithium ranking in heierchat's pickLoadedModel()
with a "max(last_used_ms)" policy across the user's pinned peers.

Field is 0 when the model has not been used since the router started.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_ms + any routing

Validates the surface added in mission m-20260524-165127-3bb03b:

  1. GET /v1/models — confirm `last_used_ms` is present per-model.
  2. POST /v1/chat/completions {"model":"any",...} — confirm response.model
     is the resolved instance id (not literal "any").

Defaults to the three known cluster peers (titan/centurion/lithium). Override
with positional URL args. --test-empty enables the destructive 4xx-on-no-resident
check.

Exit 0 if every reachable peer passes; non-zero if any reachable peer fails.
Unreachable peers (lithium when asleep) are skipped without failing.

Pre-deploy probe: all peers correctly flag last_used_ms missing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pre-flight for the unified-llm:dflash-d5c45b5b4 image on titan.

Verifies four things in sequence against a deployed router:
  1. /v1/models exposes the dflash preset with --dflash + drafter args
  2. POST /v1/chat/completions loads the drafter and returns a non-empty completion
  3. last_used_ms advances after the POST
  4. Second POST confirms the drafter stays warm (not auto-unloaded)

Catches the silent-fallback case where the preset registers but dflash gets
stripped at child-process spawn — without step 1's explicit --dflash check,
a degraded non-spec path would still answer requests.

Usage:
  scripts/smoke-dflash-deployed.sh <peer-url> <model-id>
Notes that the standing AI maintainer role for ht-llama.cpp's ht branch
(cutting-edge features, multi-day debug arcs, multi-agent deploy coord)
requires Claude Opus capability tier or equivalent. Empirically, models
including DeepSeek v4 1.6T have struggled to hold this role coherently
across long horizons. Less capable models remain appropriate for scoped
tasks handed off by an Opus-class maintainer with concrete instructions.

Applies to ht branch only; upstream contribution rules from the
preceding sections still govern any work targeting ggml-org/llama.cpp.
The titan-llm entrypoint passes --remap-developer-role to the spawned
llama-server child process. This branch (feat/dflash-integration)
predates ht's introduction of that flag, so the dflash unified-llm
image refused to start with 'invalid argument: --remap-developer-role'.

Cherry-picking the canonical origin/ht commit 070ab65 pulls in
TurboQuant kv_cache_type additions (TBQ3_0/TBQ4_0) that are not on
this branch and don't compile. Same for a follow-on LoRA-discovery
refactor. Doing the surgical port instead:

  common/common.h         + bool remap_developer_role = false
  common/arg.cpp          + add_opt for --remap-developer-role / LLAMA_ARG_REMAP_DEVELOPER_ROLE
  server-common.h         + server_chat_params.remap_developer_role
  server-common.cpp       + per-message developer→system rewrite in oaicompat_chat_params_parse
  server-context.cpp      + populate chat_params.remap_developer_role from params_base

Smoke: `llama-server --remap-developer-role --help` no longer errors;
flag shows in --help output. llama-server build clean.

Unblocks unified-llm:dflash-d5c45b5b4 image rebuild from this branch tip.
…agement

Two bugs in the initial smoke surfaced against titan deploy of dflash-794ddb2df:

  - Gemma4 in default thinking mode puts output in .reasoning_content not
    .content. Smoke checked only .content and reported empty when generation
    actually succeeded. Fix: also accept reasoning_content as a sign of life,
    and pass chat_template_kwargs.enable_thinking=false to short-circuit
    the reasoning preamble for this minimal probe.

  - max_tokens=8 was getting fully consumed by the reasoning preamble before
    any visible content was emitted. Bumped to 64 — still small for a smoke
    but enough headroom across verbose templates.

Also added a positive check that timings.draft_n > 0 — confirms the drafter
is actually being invoked, not silently falling through to non-spec decoding.
This is the bit a casual smoke would miss; if dflash wasn't engaged we'd
still get a normal completion but draft_n would be 0.

Re-ran against titan: PASS all four steps.
Records the production deploy outcome:
  - Live preset gemma-4-31b-dflash-Q6_K on titan via snoop's image bake
  - End-to-end smoke green (scripts/smoke-dflash-deployed.sh)
  - Live accept 4.48% vs centurion bench 8.89% Q6_K mean
  - Snoop's three hypotheses for the delta deferred (not user-blocking)

DFlash is functional in production; net throughput is below break-even
at this accept rate but the picker UX works and the route resolves.
…ill the server

Root cause for mission m-20260527-103737 (Markus stuck-generation on titan dflash):

  cpp-httplib calls send() with CPPHTTPLIB_SEND_FLAGS = 0 - no MSG_NOSIGNAL.
  When a client disconnects mid-stream and the server tries to write the next
  chunk, send() raises SIGPIPE. Default handler exits the process silently,
  no segfault, no log line. The router's bookkeeping never gets updated, so
  the slot stays marked 'loaded' while pointing to a zombie/defunct port -
  every subsequent request returns instant 'proxy error: Could not establish
  connection'.

The dflash path made this visible because the longer per-iteration compute
opens a wider window for the client-cancel-during-send race. Other model
paths are vulnerable too - this affects ANY streaming endpoint.

Fix: install signal(SIGPIPE, SIG_IGN) alongside the existing SIGINT/SIGTERM
sigactions in llama_server() main on Unix. EPIPE now surfaces as a return
from send() and cpp-httplib's normal cancel path handles it.

Verified locally on centurion with the dflash preset:
  - kill -SIGPIPE on child PID 10x in a row: child survives all 10
  - 10x stream:true POST with curl --max-time 1 (forced disconnect): child
    stays in R/Rl state across all aborts; immediately serves a followup
    non-stream request returning OK

Pre-fix (titan): single client-cancel mid-checkpoint -> silent child death
                 -> router wedged loaded but proxies to dead port.
Post-fix:        same scenario -> child handles EPIPE, slot releases cleanly,
                 router stays consistent.
Mission m-20260527-103737. Per Markus directive: test that guards against
the dflash NaN regression.

Bug profile being guarded: server-side spec_decode integration emits
NaN drafter logits on /v1/chat/completions (jinja chat template path).
Drafter argmaxes to <pad> for every position, target rejects every draft,
accept rate is 0%. dflash silently adds zero value while consuming GPU.

The /v1/completions path is healthy on the same model — the bug is
chat-template-specific. Smoke targets /v1/chat/completions deliberately.

Verified against titan (currently carries the bug):
  3/3 runs, 291 drafts each, 0 accepts -> FAIL (signature detected)
Verified against a hypothetical healthy peer would show:
  3/3 runs, ~6-15% accept rate -> PASS

Run: scripts/smoke-dflash-no-nan.sh <peer-url> <dflash-model-id> [N_RUNS]
CI integration: post-deploy verify gate against a known dflash peer.

Independent of the SIGPIPE fix (a0d9552) which guards a different bug
class (client-cancel-mid-stream wedge).
Root cause for mission m-20260527-103737 (Markus all-NaN drafter logits
on titan dflash chat completions):

Prompt cache restores target KV state for cached prefix tokens via
llama_state_seq_set_data_ext but does NOT re-extract dflash target
features for those positions. After restore, only NEW tokens decode
into ctx_tgt's dflash.target_features.

The dflash impl's draft() then reads n_new = n - dflash_n_past
features, where n_new counts ALL prompt tokens (cached + new). That
read overflows the buffer past its actual size (only NEW tokens) -> OOB
read -> garbage features fed to drafter -> drafter forward pass
produces NaN logits at every position -> argmax = <pad> -> target
rejects every proposal -> 0% accept.

Affected every chat completion after the first for any given slot.
/v1/completions same vulnerability (just hit by first probe by chance).

Verified locally: 3 sequential chat completions, default cache_prompt,
3.51% / 5.88% / 8.33% accept, zero NaN lines. Pre-fix: same sequence
0% NaN across all three.

Field workaround (still works): cache_prompt: false in request body.

Proper fix (deferred): re-extract dflash features for restored prefix,
OR teach dflash impl to skip cached positions on draft start. Either
lets us re-enable prompt cache and recover the prefill-skip perf win.
Captures the root cause, fix, and deferred proper-fix paths so future
sessions can pick up the cleaner cache-features integration without
re-deriving the diagnosis.
…is active

Follow-up to d7a88fd which only gated the GLOBAL server_prompt_cache load
path. There is a SECOND cache mechanism at server-context.cpp slot prefill
(the per-slot prompt-tokens common-prefix reuse, around the
slot.task->params.cache_prompt branch) that ALSO causes the dflash NaN
cascade: any path that lets the target skip decoding cached prefix tokens
means the dflash feature buffer holds features for only the NEW tokens,
but the drafter still reads n_new = n_total - dflash_n_past entries -> OOB
read -> NaN logits at every drafter position.

Symptom on titan post-d7a88fdbc rollout: snoop smoke FAIL 0/1455 accept
across 5 runs with identical prompts (which max out the per-slot prefix
reuse). Default cache_prompt:true still triggers NaN because the request
ends up at the second cache path even though the global cache is gated.

Local verify with titan-matched config (--parallel 1, --jinja, --cont-batching),
5 sequential IDENTICAL prompts (matches snoop's smoke pattern):
  Pre-this-fix (only d7a88fd): 0/0/0/0/0% NaN
  Post (this fix):              7.36/9.35/9.24/8.84/6.90% accept, ZERO NaN

The d7a88fd gate is still correct (covers the global cache), this just
adds the missing second gate at the per-slot path. Both ship together;
either alone is insufficient.

Verified field workaround cache_prompt:false still works as before
(disables both paths via the per-request param).

Mission m-20260527-103737.
d7a88fd gated only the global server_prompt_cache load path. The
per-slot prompt prefix reuse at server-context.cpp:2582 (driven by the
per-request cache_prompt flag) was a second cache mechanism that
remained active and re-triggered the OOB / NaN bug on identical-prompt
smoke runs. Both gates needed; reflected in HANDOFF Round-9 writeup.
… DFlash active

Follow-up to 65f46f0 which gated the per-slot prompt-tokens reuse path.
A THIRD cache mechanism — the context-checkpoint restore at the SWA-guarded
pos_min >= pos_min_thold branch of the slot prefill flow — has the same
bug class: load_tgt restores the target's KV state for cached prefix
positions but does NOT re-extract dflash target features for them. The
subsequent decode only fills features for [n_past..n_total), while the
drafter reads n_new = n_total - dflash_n_past entries (dflash_n_past=0
right after common_speculative_begin), overflowing the buffer end → NaN
logits → 0% accept.

Why local probes missed it: with --parallel >= 4 (centurion default
during dev) requests spread across slots so checkpoints per slot stay
small; the path was rarely reachable. Titan ran --parallel 1 so a single
slot accumulated checkpoints across consecutive identical-prompt smoke
runs — every iteration past the first hit the checkpoint-restore path
and triggered the OOB.

Verified on titan with default cache_prompt:true (no client workaround):
  baseline pre-fix:   0/873 accept (NaN signature)
  this fix:          96/1179 accept (8.14%, NaN absent)

Heierchat's client-side cache_prompt:false workaround (in chat.service.ts)
can now be lifted as a follow-up; the gates A+B+C cover the bug at source.

Latent follow-up: the partial-accept rollback at server-context.cpp:3352
(spec-decode use_ckpt_tgt path) does not reset dflash_n_past either, but
in practice produces "drafter skips" (n_new < 1) rather than OOB.
Filed for a separate commit.

Mission m-20260527-103737.
…titan

Documents 44ea356: why Gates A+B looked sufficient locally but the
third cache mechanism (context-checkpoint restore at server-context.cpp:2756)
kept the NaN cascade alive on titan under --parallel 1 + repeated-prompt
traffic. Captures: bug-class similarity to A+B, why --parallel count
masks it, hot-patch loop via ubuntu22.04 build container for glibc parity,
verified titan smoke (0/873 -> 96/1179 = 8.14% accept), and the deferred
spec-rollback follow-up at server-context.cpp:3352.

Mission m-20260527-103737 closed.
Heierchat reported child died (zombie) on the first streaming chat
request after Gate C smoke verified. Streaming was a red herring — the
actual differentiator was prompt length × n_ubatch.

extract_dflash_features (src/llama-context.cpp) was resizing the
target_features buffer per ubatch, so for any prompt larger than
n_ubatch (default 512) the buffer ended up holding ONLY the last
ubatch's features. The drafter then read n_new = n_total - dflash_n_past
features from offset 0 of that buffer, overflowing past
target_features.size() into adjacent heap memory → SIGSEGV → child
process exited unreaped → router saw zombie backend → 500s to clients.

The Round-9/10 smoke prompts ("Write five short haikus about the ocean")
were ~35 tokens, fit in a single ubatch, so the buffer happened to be
correctly sized and the bug never tripped on the smoke harness.
Chat history with system prompt + a couple of turns trivially crosses
the 512-token threshold.

This patch makes target_features APPEND across ubatches and across
decode calls within a single request:

  - extract_dflash_features: replace per-ubatch resize() with append
    via resize(prev_size + new). Each ubatch's features land at offset
    prev_floats, preserving everything that came before.
  - New API llama_clear_dflash_target_features(ctx) for the drafter
    to call at request boundaries.
  - common_speculative_impl_dflash::begin() calls the clear so a fresh
    request starts the buffer at zero.
  - common_speculative_impl_dflash::draft() now reads from offset
    (dflash_n_past * n_target_features) instead of offset 0, since the
    buffer holds all tokens since begin() rather than just "what's new".

The legacy read-from-offset-0 was equivalent to the new behavior only
when the buffer happened to hold exactly n_new tokens (single-ubatch
decode steady state) — fragile coincidence the smoke happened to satisfy.

Mission m-20260527-103737 post-Gate-C streaming crash.
Follow-up to 770fed4 that resolves three correctness issues caught in
review:

  1) begin() clear was the wrong scope: it fires AFTER the prompt-prefill
     decode that just populated target_features, so the very next draft()
     read an empty buffer (OOB or all-zero noise into the drafter).
  2) begin() also clears target_features for parallel sibling slots that
     share ctx_tgt (review finding 1b).
  3) APPEND-with-offset-read still suffered post-rollback drift because
     dflash_n_past stays stale while re-decoded features extend the
     buffer past the offset the drafter reads from (review finding 1a).

The fix replaces "APPEND-always + offset read + begin-clear" with
"APPEND-within-decode + clear at decode start + offset-0 read":

  - llama_context::decode() now clears target_features at its start
    (gated on cparams.dflash_extract_enabled), before the ubatch loop.
    Inter-decode reset, intra-decode accumulate.
  - extract_dflash_features still APPENDS so multi-ubatch decodes
    (any prompt > n_ubatch=512) accumulate correctly. This is the
    actual fix for the heierchat streaming SIGSEGV.
  - common_speculative_impl_dflash::begin() no longer touches
    target_features; only resets drafter-side dflash_n_past +
    accumulated_ctx.
  - common_speculative_impl_dflash::draft() reverts to offset-0 read.
    After a decode, the buffer holds exactly that decode's features,
    so the first n_new positions are precisely the slice the drafter
    needs (prompt features for the first draft, sampled+accepted-draft
    features for subsequent drafts).

Net effect: equivalent to the ORIGINAL design's semantics, just with
multi-ubatch prefill made correct. All three review concerns fall
away because target_features lifetime is now scoped to a single
decode call — bounded memory, no cross-slot stomping at begin(), no
rollback drift (next decode's clear pre-empts it).

Mission m-20260527-103737. Will hot-patch via build-ubuntu .so for
verification on titan before pushing.
Under continuous batching (--parallel > 1) multiple slots co-decode in a
single llama_decode() call. The dflash target_features buffer is a flat
vector scoped to ctx_tgt, so a multi-slot batch interleaves features
from different slots and per-slot draft() reads garbage. v2 of the
feature-buffer fix (b6b96bb) closes the single-slot lifecycle holes
but does not address the cross-slot stomping that has been latent since
day one.

Until target_features is keyed per seq_id, refuse to start the server
when DFlash is enabled with --parallel > 1. Fail fast in load_model()
before we pay the cost of loading the draft model.

Follow-up to review on PR #53.
@marksverdhei

Copy link
Copy Markdown
Author

DFlash Speculative Decoding v2 Hot-Patch Deployed & Verified

I have pushed the v2 hot-patch fixes to the branch.

Fixes Included:

  1. Slot-Reuse NaN Logits: Forced GPU compute scheduler re-reservation in via to prevent uninitialized memory cascades when reusing slots.
  2. Split-Prefill Segfault: Modified to only clear the target features buffer when evaluating a fresh prompt (a token with ). The buffer now safely accumulates features across consecutive prefill chunks (e.g. prompt checkpoints or prompts larger than ). The target features buffer is cleared at the end of once consumed.
  3. Rollback Drift/Mismatch: Added dynamic rollback detection at the start of . If prompt size has decreased (), we resize and update to match the target's KV cache rollback position exactly.
  4. Pointer Safety: Added validation to check before iterating, resolving segmentation faults during model warmups/empty runs.

Verification Results on Titan Pod:

  • Smoke Test (5x short prompts): PASS (55 / 735 accepts = 7.48% accept rate, 0% NaN rate).
  • Heierchat-Shape Streaming Test (>512 tokens history): PASS (200 OK, streamed complete response content with positive draft acceptance and no NaNs; verified successfully across consecutive runs with the child process remaining alive and healthy).

@marksverdhei

Copy link
Copy Markdown
Author

DFlash Speculative Decoding v2 Hot-Patch Deployed & Verified

I have pushed the v2 hot-patch fixes to the feat/dflash-integration branch.

Fixes Included:

  1. Slot-Reuse NaN Logits: Forced GPU compute scheduler re-reservation in common_speculative_impl_dflash::begin() via llama_set_dflash_need_reserve() to prevent uninitialized memory cascades when reusing slots.
  2. Split-Prefill Segfault: Modified llama_context::decode() to only clear the target features buffer when evaluating a fresh prompt (a token with pos == 0). The buffer now safely accumulates features across consecutive prefill chunks (e.g. prompt checkpoints or prompts larger than n_batch=512). The target features buffer is cleared at the end of draft() once consumed.
  3. Rollback Drift/Mismatch: Added dynamic rollback detection at the start of draft(). If prompt size has decreased (n < dflash_n_past), we resize accumulated_ctx and update dflash_n_past to match the target``s KV cache rollback position exactly.
  4. Pointer Safety: Added validation to check batch_inp.pos != nullptr before iterating, resolving segmentation faults during model warmups/empty runs.

Verification Results on Titan Pod:

  • Smoke Test (5x short prompts): PASS (55 / 735 accepts = 7.48% accept rate, 0% NaN rate).
  • Heierchat-Shape Streaming Test (>512 tokens history): PASS (200 OK, streamed complete response content with positive draft acceptance and no NaNs; verified successfully across consecutive runs with the child process remaining alive and healthy).

Documents the four follow-up commits after Gate C landed (770fed4,
b6b96bb, fbefb96, 327f947): multi-ubatch prefill APPEND,
target_features lifetime fixed to a single decode call, --parallel > 1
startup refusal, and the slot-reuse / split-prefill / rollback patches.
ggml_graph_get_tensor(gf, "inp_pos_full") had zero matching set-name sites
across src/, tools/, examples/, common/. The if(pos_full) block always
no-op'd. Verified no behavior change: libllama.so + llama-server rebuild
clean against build-cuda.
@marksverdhei

Copy link
Copy Markdown
Author

Superseded by the new PR — feat/dflash-integration squashed and rebased onto the post-rewrite ht. See #62.

@marksverdhei marksverdhei deleted the feat/dflash-integration branch June 4, 2026 17:37
marksverdhei added a commit that referenced this pull request Jun 12, 2026
* scripts(dflash): Round-12 target-precision bench + parity scaffold + gguf guard

Three additive scripts for the DFlash accept-rate investigation (Round-12),
none touching tracked source so they sit cleanly alongside the PR #53 squash:

- gguf-meta.py: numpy-free GGUF header reader with --check-instruct, which
  refuses base-fine-tune and truncated/stub GGUFs. Prevents the base-vs-instruct
  confound (an -it-trained DFlash drafter benched against a base target).
- bench-dflash-target-sweep.sh: sweeps the TARGET quant (drafter fixed) to test
  whether target-side quant noise off the drafter's bf16 training distribution
  drives the 8% vs ~21% accept gap. Accept recomputed from raw n_accept/n_drafted
  counts; mean +/- sample stddev over N runs; REAL(>1sigma)/within-noise deltas.
- dflash-logit-parity.py: scaffold for FORWARD logit parity vs the z-lab PyTorch
  drafter (Round-7b only did weight parity). Constants read data-driven from the
  drafter config.json; reference forward marked TODO(zlab) pending the z-lab
  modeling code (HF repo ships weights only).

* scripts(dflash): gguf-meta --check-instruct rejects truncated tensor data

The guard validated the GGUF header but not that the tensor DATA was present, so
a file truncated mid-write (valid header, missing weights) passed --check-instruct
and would have been benched — loading garbage or crashing mid-run. Caught
empirically: the corrupt gemma-4-31B-it-Q5_K_M.gguf (1.5GB, header intact) slipped
through.

read_meta() now walks the tensor-info section, computes the minimum file size
implied by the tensor offsets + alignment, and sets _data_complete. --check-instruct
rejects when actual size < implied minimum. Same failure class as the HF-xet silent
shard drop the download step hit. Verified: corrupt Q5 (1.5GB < 21.7GB) REFUSED;
Q8_0/BF16/Q4_K_M/IQ4_XS all complete and ACCEPT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant