feat(dflash): complete DFlash speculative decoding integration by marksverdhei · Pull Request #53 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-05-22T10:06:28Z

Summary

DFlash speculative decoding (upstream PR ggml-org#22105 by ruixiang63) integrated for Gemma4-31B target + Anbeeld/gemma-4-31B-it-DFlash-GGUF drafter.

Status: functional but underperforming. Best Q6_K acceptance 11.36% (mean 8.89% over 3 runs) on conversational prompts. Reference acceptance per vLLM PR #41703 on Gemma4-26B with same drafter shape: MT-Bench 21.68%, HumanEval 44.69%. ~12pp gap remains, bench-bound on BF16 testing.

Hold for merge until acceptance reaches the net-speedup break-even point (~25% accept at block_size=16). The PR is preserved as a long-running working branch.

What's in this branch

Core integration (e354dd747..449638691):

LLM_ARCH_DFLASH + KV namespace (target_layer_ids, block_size, mask_token_id, n_target_features)
DFlash encoder + decoder graph in src/models/dflash.cpp
Three-stage feature extraction pipeline (tag → capture → copy) via dflash_extract_N cb hooks
common_speculative_impl_dflash with full encode→accumulate→decode→sample loop
Public API: llama_set_dflash, llama_get_dflash_target_features, llama_set_dflash_accumulated_target_ctx, model helpers
Server flag: --dflash for /v1/chat/completions
model: "any" resolves to most-recently-used resident model on the router (c468706cd)

Root-cause fix (b0a828e8e):

Drafters share tok_embd + lm_head with the target. For Gemma4 targets, the drafter must inherit sqrt(n_embd) noise embedding normalization and final_logit_softcapping = 30.0 (the transforms Gemma4's pipeline applies around those shared weights). Per vLLM PR #41703.
llama-context.cpp cross-binding sets f_embedding_scale and f_final_logit_softcapping when target arch is LLM_ARCH_GEMMA4.
Watch out: build_inp_embd auto-applies f_embedding_scale (Granite-arch path) — do NOT also add a manual ggml_scale in the drafter graph, or you double-scale.
Lift: Q6_K mean 6.88% → 8.89% (+2pp), best run 11.36%.

Correctness fix (4b10869a7):

Per-layer SWA mask in drafter decoder. Drafter GGUF carries attention.sliding_window=2048 and sliding_window_pattern=[T,T,T,T,F]. Bench-neutral at ctx≤2048, correctness-correct for future expansion.

Audit conclusions (HANDOFF.md on branch)

Tested and ruled out:

Feature-slice off-by-one (Round-2)
Extraction-point off-by-one (Round-3, A/B within noise)
Per-layer renorm of fused_target (Round-4, made accept WORSE)
Structural divergence vs upstream PR [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105 / z-lab PyTorch / vLLM qwen3_dflash (Round-5)
GGUF tensor inventory miss (Round-7, 58/58 tensors map cleanly)
Q6_K conversion fidelity (Round-7b, max 2.155% relative RMS error vs bf16 safetensors)

Tooling added in this branch

scripts/bench-dflash.sh — systematic Q4/Q6/Q8/BF16 × MT-Bench/HumanEval × 3 runs with VRAM guard
scripts/compare-dflash-weights.py — per-tensor safetensors↔GGUF numerical comparison
tests/test-dflash.cpp — 8 unit tests (arch registration, hparams, graph params, API symbols, etc.)

Tests

All 8 DFlash unit tests pass (./build-dflash/bin/test-dflash)
Build clean for llama-cli, llama-server, llama-speculative-simple

Known gaps (not blockers for this PR, captured in HANDOFF)

Acceptance gap to vLLM reference (~12pp) — likely Q6_K quant compounding through 5 drafter layers. BF16 drafter test pending VRAM coordination.
Multi-seq support: common_speculative_impl_dflash::draft() processes only seq_id=0.
gemma4.cpp DFlash extraction has three env-gated modes (LLAMA_DFLASH_EXTRACT={late,early,upstream}) — late wins by a hair, kept as default. Other env knobs documented in HANDOFF.

Reference

Upstream PR [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105
vLLM PR [Spec Decode] Fix Gemma4 DFlash batched verification vllm-project/vllm#41703 (Gemma4-specific fixes — the source of our embed-scale + softcap fix)
z-lab/dflash PyTorch reference

🤖 Generated with Claude Code

- Add LLM_ARCH_DFLASH arch enum, KV keys, tensor enums - Add DFlash hparams (target_layer_ids, block_size, mask_token_id) - Add DFlash cparams (dflash_extract_enabled) - Add llama_dflash struct + graph context fields - Add DFlash C API (llama_set_dflash, llama_get_dflash_target_features, etc.) - Add DFlash extraction pipeline in llama-context (set_dflash, extract_dflash_features, graph_get_cb hook) - Add DFlash graph type handling, position tensor fill, decoder context init - Add llama_model_dflash class (load_arch_hparams/tensors, build_arch_graph) - Add llm_build_dflash_encode/decode graph builders - Add DFlash model file (src/models/dflash.cpp) - Add --dflash CLI arg, DFLASH speculative type - Wire DFlash in server-context, speculative-simple, convert_hf_to_gguf Source: upstream PR ggml-org#22105 by ruixiang63 (ggml-org/llama.cpp) Adapted for post-Spring-Cleaning refactor master. Still WIP — llama-model-loader arch registration and some model wiring pending.

Resetting llama-context.cpp/h, llama-model.cpp/h, llama-graph.cpp, models.h, model-saver.cpp to master. The squash-merge generated too many stale patch residues. Will reapply DFlash additions cleanly.

…odel-only) Adds DFlash speculative decoding library infrastructure: - LLM_ARCH_DFLASH arch enum + KV keys + tensor enums - DFlash hparams (target_layer_ids, block_size, mask_token_id) - DFlash cparams (dflash_extract_enabled) - llama_dflash struct (extraction layer indices, target features) - llama_model_dflash class (load_arch_hparams/tensors, build_arch_graph) - llm_build_dflash_encode/decode graph builders - DFlash C API (llama_set_dflash, etc.) - --dflash CLI arg, COMMON_SPECULATIVE_TYPE_DFLASH Extraction pipeline (llama-context.cpp) TBD — needs fresh hooks written for post-Spring-Cleaning master.

- Architecture: LLM_ARCH_DFLASH enum + KV params (target_layer_ids, block_size, mask_token_id) with GGUF model loading - Model: DFlash encoder (fc fusion + rms norm) and single-layer cross-attention decoder with noise-token input - Graph: llama_dflash struct routed through llm_graph_params/context, with target_model pointer in llama_context_params - Extraction pipeline: graph_get_cb intercepts dflash_extract_N names, extract_dflash_features reads tagged tensors post-compute - API: llama_set_dflash, llama_get_dflash_target_features, llama_set_dflash_accumulated_target_ctx, plus model helpers - Speculative decoder: common_speculative_impl_dflash with full encode->accumulate->decode->sample draft loop in common/speculative.cpp - Model injection: llama model graph builder tags hidden states at DFlash target layers via cb('dflash_extract_N') Compiles clean: llama-server, llama-cli, all libraries.

- Removed 9 .orig/.rej patch backup files accidentally committed - Added tests/test-dflash.cpp with 8 tests covering: * Arch registration and name lookup * Tensor info maps (layer/op assignment) * hparams defaults for DFlash fields * llama_dflash struct lifecycle and clear() semantics * llm_graph_params dflash pointer wiring * COMMON_SPECULATIVE_TYPE_DFLASH enum placement * llama_context_params target_model field * DFlash API symbol link-time resolution - Registered test-dflash in tests/CMakeLists.txt - All 8 tests pass

…hrough POST /v1/chat/completions with {"model": "any"} now resolves to whichever model the router currently has resident in memory — preferring LOADED over SLEEPING, and within each tier the most-recently-used. Returns "no model is currently resident in memory" (HTTP 400) if nothing is loaded. Lets clients reach the active model without having to track which one the router decided to keep resident. The sentinel "any" is reserved at the router lookup layer; a user model literally named "any" would be unreachable via that string (still reachable via aliases). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

End-to-end DFlash speculative decoding compiles, loads, and runs against gemma-4-31B-it-Q4_K_M + Anbeeld/gemma-4-31B-it-DFlash-GGUF (Q4_K_M). But acceptance rate is 4.9% (9 of 183 drafted tokens) — DFlash is ~2x slower than baseline (14.9 vs 29.1 t/s on the centurion 3090, FA on, temp 0). Integration is correct mechanically; the perf claim does not land yet. Ruled out: - Arch loading: llm_arch_from_string strips -draft, model-loader passes arch_name_override so dflash-draft.* KV lookups resolve. - Tokenizer mismatch: vocab + merges sha256 byte-identical between target and drafter, only EOS designation differs (target=106 end_of_turn, drafter=1 eos — end-of-stream only, doesn't affect mid-stream verification). - Drafter graph structure: cross-attention over target_hidden with pos_ctx + kq_mask filled in llm_graph_input_dflash::set_input. - Feature extraction hooks fire on the right layers (target_layer_ids = [1,12,23,35,46,57] for the 60-layer target); gemma4.cpp + llama.cpp both tag post-l_out hidden state. Prime suspect (logged in HANDOFF.md): in common_speculative_impl_dflash ::draft() the call to llama_get_dflash_target_features(ctx_tgt) returns features for the last ubatch (K+1 tokens during verification) but we slice the first n_new (typically 1-2 committed). If ubatch position order doesn't line up with commit order, drafter gets fed features for discarded tokens, cascading misalignment that would produce ~5% acceptance even with a perfectly aligned drafter (per snoop-kube's analysis). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rate hunt Adds two env knobs to common_speculative_impl_dflash::draft() so the next bench iteration can isolate the 4.9% accept root cause without rebuilding: LLAMA_DFLASH_DEBUG=1 Print per-iteration (n_new, n, dflash_n_past_old, features[0..4]). Confirms slice/alignment hypothesis by exposing actual values. LLAMA_DFLASH_CTX_WINDOW=<n> Cap accumulated context to N tokens (default 512). Set to 0 to disable truncation and feed full accumulated context to drafter. HANDOFF expanded with structural-consistency findings vs the dflash-pr POC (POC expects ffn_norm tensors; our GGUF has post_attention_norm, so the POC graph isn't directly portable) and a prioritized list of next experiments (graph-reuse disable via LLAMA_GRAPH_REUSE_DISABLE=1, ctx truncation disable, BF16 drafter, extraction-point swap). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Added LLAMA_DFLASH_EXTRACT=early env toggle to gemma4.cpp that captures target hidden states BEFORE the per-layer-embedding processing and out_scale (vs default: after l_out). Late extraction empirically wins (Q6_K: late 10.7% vs early 6.2%; Q4_K_M: late 4.9% vs early 5.6%) so the default stays — but the knob is now wired for future ablations. Round-2/3 bench findings in HANDOFF.md: - Graph reuse INNOCENT: LLAMA_GRAPH_REUSE_DISABLE=1 gives identical 4.92% accept, ruling out cross-iteration tensor corruption. - ctx_window truncation has minor effect (+25% accept when disabled). - Drafter quant has bounded effect (~4-10% accept range across Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16). - Extraction point not the major lever. Remaining hypotheses: per-layer renorm of fused_target inside the dflash decoder graph, or RoPE position scheme not matching training. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…worse Added env-gated experiment LLAMA_DFLASH_PER_LAYER_RENORM in src/models/dflash.cpp that re-applies layer.attn_norm to fused_target before each layer's wk/wv projection. 3x3 A/B on Q6_K with centurion-llm scaled to 0: renorm OFF: 5.56% / 6.22% / 6.22% (mean 6.00%) renorm ON: 2.02% / 4.92% / 4.30% (mean 3.75%) Per-layer renorm degrades accept by ~2.25pp. Drafter was NOT trained with per-layer ctx renorm; current single-norm-at-entry implementation (matching the POC design) is correct. Env-gate stays in (defaulted off) for future ablation symmetry. Also surfaces a separate finding: Q6_K baseline accept varies ±2pp run-to-run on same seed/prompt/code — HANDOFF Round-3 table value of 10.69% for Q6_K appears to be an outlier or stale-code state; reproducible range under current HEAD is 4.3-6.2%. Possible causes: sampler RNG, KV cache state leakage, CUDA reduction non-determinism. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rence Full implementation audit against authoritative sources (upstream ggml-org/llama.cpp PR ggml-org#22105, z-lab/dflash PyTorch reference, vLLM qwen3_dflash module, Anbeeld drafter GGUF metadata dump). Findings (full table in HANDOFF.md): Matches reference cleanly: fc + hidden_norm once outside loop; K/V concat order [ctx, noise]; attn_norm on noise only; q_norm/ k_norm placement; V not normed/RoPE'd; block content [id_last, MASK×15]; sample positions 1..15; attn_post_norm as FFN-input norm (Gemma-specific); SwiGLU FFN; lm_head and tok_embd bound to target at llama-context.cpp:376-377; non-causal attention. Divergences worth noting (none explain the 4-8% accept): - SWA pattern [T,T,T,T,F] in drafter GGUF, not implemented in our decoder graph. Irrelevant at ctx_window=512. - Local position scheme vs reference's monotonic absolute positions. Equivalent under RoPE-relative attention. Round-5 bench: extraction-point ablation (LLAMA_DFLASH_EXTRACT=upstream added in gemma4.cpp, tags inpL at layer start to match upstream PR's +1-shift convention): mode=late (current default): 8.51% / 7.64% / 4.49% mean 6.88% mode=upstream (PR convention): 4.49% / 7.64% / 5.23% mean 5.79% Means overlap within one sigma; exact accepted/drafted counts repeat across modes (11/144 and 7/156) — bench has ~3 RNG-driven states and extraction-point is not the decision boundary. Late wins by a hair, weakly suggesting Anbeeld's Gemma converter did NOT apply the +1 shift. The gemma4.cpp refactor consolidates the three extraction modes (late default / early / upstream) behind a single dflash_mode enum to avoid double-tagging when multiple modes' static lambdas would otherwise both fire. Conclusion: no single structural bug at the llama.cpp level explains the 4-8% ceiling vs published 30-50%. Top remaining suspect is GGUF conversion fidelity vs Anbeeld safetensors (requires HF download + reference inference setup, parked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per snoop-kube's audit review: SWA absence is latent at ctx_window=512 but would matter the moment LLAMA_DFLASH_CTX_WINDOW exceeds 2048 — clarify the regime where the divergence becomes real, so future ctx-expansion work flags it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ftcap Root-cause find from vLLM PR #41703 description: "DFlash shares target embeddings. For Gemma4 targets, the draft path now applies the target embedding normalization (sqrt(hidden_size)) and passes final_logit_softcapping into LogitsProcessor." DFlash drafters share tok_embd with the target model. Gemma4's standard input pipeline scales token embeddings by sqrt(n_embd) (~73x at hidden 5376) before they hit the first decoder layer. The drafter was trained against those scaled embeddings — feeding raw embeddings is ~73x off. Same story for the final logit softcap (30.0): drafter trained against the softcapped distribution, so its lm_head output (which shares the target's lm_head) needs the same transform applied. llama-context.cpp cross-binding: when target arch == LLM_ARCH_GEMMA4, inherit f_embedding_scale = sqrt(target.n_embd) (BF16-rounded to match Gemma4 training precision) and f_final_logit_softcapping = target's value. For non-Gemma4 targets (e.g. Qwen3) explicitly zero both, so non-Gemma drafters do not pick up stale defaults. dflash.cpp: - f_embedding_scale: applied automatically by build_inp_embd via its existing Granite-arch code path (llama-graph.cpp:1827-1829). No manual ggml_scale needed in dflash.cpp — a manual scale double-applies because build_inp_embd already does it. (First fix attempt did this manual scale, tanked Q6_K from 6.88% → 2.65%. Lesson noted.) - f_final_logit_softcapping: applied manually after lm_head matmul, matching gemma4.cpp:443-447 exactly. Monotonic so does not affect greedy argmax, but matches drafter's training distribution. Bench result (Q6_K drafter, 3 runs, q8_0 KV): baseline: 8.51% / 7.64% / 4.49% mean 6.88% with fix: 6.80% / 11.36% / 8.51% mean 8.89% (+2pp) 11.36% Q6_K is the highest accept rate of the entire dflash project (prior best was HANDOFF Round-3 lucky 10.69% single sample). The fix is partially vindicated — moved the needle, crossed double digits cleanly for the first time — but +2pp mean is not the published 30-50% range, so something else is still missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Update HANDOFF with the b0a828e fix result and the build_inp_embd double-scale footgun for future arch ports. Best Q6_K accept crosses double digits (11.36%) for the first time; mean lift +2pp. Reframes the goal: vLLM PR #41703 published 21.68% on MT-Bench (conversational) and 44.88% on HumanEval (code). Our prompt is MT-Bench-class so the realistic target is ~21%, not 44%. We're at 8.89% mean — ~12pp gap remains. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Compared safetensors against Anbeeld GGUF tensor list. All 58 tensors present in both with correct shapes. Hypothesis 1 (GGUF conversion fidelity at the inventory level) ruled out. What remains untested: per-tensor numerical comparison (bf16 reference values vs Q6_K dequantized). Would need torch + ggml-py + a few hours to script properly. Next logical diagnostic but not started. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…an Q6_K Wrote /tmp/compare_dflash_weights.py to compare bf16 safetensors against dequantized GGUF Q6_K, per tensor. Results: F32 norm tensors (22): exact match (0% error) Q6_K weights (36): 1.78% mean relative RMS error, max 2.155% No outlier tensor. Q6_K is a clean quantization of the z-lab safetensors. Hypothesis 1 (GGUF conversion fidelity) is RULED OUT at both inventory and numerical levels. The remaining accept-rate gap (~12pp to vLLM's 21% MT-Bench reference) is most likely Q6_K compounding through 5 drafter layers — only way to confirm is a BF16 drafter bench (needs ctx <= 2048 + VRAM coordination, currently OOMs at ctx=4096). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Move /tmp/compare_dflash_weights.py into scripts/compare-dflash-weights.py so the diagnostic survives reboot and is reproducible. Parameterized via argv for arbitrary safetensors + GGUF paths; defaults to the Gemma4 31B DFlash drafter pair under \$MODELS. Used by Round-7b to confirm the Anbeeld GGUF Q6_K is a clean quantization of the z-lab safetensors bf16 (mean 1.78% relative RMS error, max 2.155% — normal Q6_K quantization noise, no outlier tensor indicating a converter bug). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drafter GGUF carries attention.sliding_window=2048 and attention.sliding_window_pattern (e.g. [T,T,T,T,F] for the Anbeeld Gemma4 drafter). Our decoder previously used uniform full attention on all layers. Latent at ctx_window<=2048 (no token gets windowed out) but breaks correctness the moment ctx grows past the window — SWA layers were trained against masked attention but inferenced without it. Implementation: src/models/dflash.cpp:load_arch_hparams Reads LLM_KV_ATTENTION_SLIDING_WINDOW into hparams.n_swa, sets swa_type = LLAMA_SWA_TYPE_STANDARD, populates hparams.swa_layers from LLM_KV_ATTENTION_SLIDING_WINDOW_PATTERN. Re-uses the std::array<int,16> template instantiation already provided for dflash_target_layer_ids (vector<bool>/vector<int> overloads are not template-instantiated for get_arr). Falls back to all-SWA if the pattern KV is absent but n_swa > 0. src/llama-graph.{h,cpp}:llm_graph_input_dflash Added optional second mask tensor kq_mask_swa with the bucket-padding mask PLUS per-(q_pos,k_pos) sliding-window masking. Only allocated when the drafter is_swa_any(). src/models/dflash.cpp:llm_build_dflash_decode Per-layer mask selection: SWA layers route through kq_mask_swa, dense layers keep kq_mask. At ctx_window <= n_swa the SWA mask is numerically identical to the full mask, so this is bench-neutral for current configs (LLAMA_DFLASH_CTX_WINDOW=512 default; SWA window 2048). Verified: - cmake build clean for llama-speculative-simple + test-dflash - 8/8 dflash unit tests pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per-layer SWA mask now implemented; bench-neutral at ctx<=2048 but correctness-correct for future ctx expansion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds scripts/bench-dflash.sh — systematic bench across drafter quants (Q4/Q6/Q8/BF16) × prompt classes (MT-Bench-style conversational, HumanEval-style code) × N runs per condition. Outputs a timestamped markdown table to /tmp/dflash-bench-<ts>.md. Each (drafter, prompt) pair runs 3x by default to show the known variance floor (~2-3pp run-to-run on same seed). Per-run stderr is preserved per-condition under /tmp/dflash-bench-<ts>-runs/ for debugging individual outliers. VRAM guard: warns if <20 GB free (centurion-llm holds ~21 GB when active; coordinate scale-down via snoop-kube first). vLLM PR #41703 published acceptance for Gemma4-26B with this drafter shape — HumanEval 44.69%, MT-Bench 21.68% — included in script header as the comparison baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cover the new SWA code paths added in 4b10869: test_dflash_swa_defaults — confirm hparams default to no-SWA test_dflash_swa_anbeeld_pattern — [T,T,T,T,F] routes through is_swa correctly test_dflash_input_swa_ctor — llm_graph_input_dflash carries n_swa via ctor 11/11 unit tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Surfaces the LRU timestamp the router already tracks (server_model_meta.last_used). Lets clients sort by MRU when picking among resident models without falling back to per-peer priority orderings. Driven by heierchat mission m-20260524-165127-3bb03b — replaces the hard-coded titan>centurion>lithium ranking in heierchat's pickLoadedModel() with a "max(last_used_ms)" policy across the user's pinned peers. Field is 0 when the model has not been used since the router started. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…_ms + any routing Validates the surface added in mission m-20260524-165127-3bb03b: 1. GET /v1/models — confirm `last_used_ms` is present per-model. 2. POST /v1/chat/completions {"model":"any",...} — confirm response.model is the resolved instance id (not literal "any"). Defaults to the three known cluster peers (titan/centurion/lithium). Override with positional URL args. --test-empty enables the destructive 4xx-on-no-resident check. Exit 0 if every reachable peer passes; non-zero if any reachable peer fails. Unreachable peers (lithium when asleep) are skipped without failing. Pre-deploy probe: all peers correctly flag last_used_ms missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pre-flight for the unified-llm:dflash-d5c45b5b4 image on titan. Verifies four things in sequence against a deployed router: 1. /v1/models exposes the dflash preset with --dflash + drafter args 2. POST /v1/chat/completions loads the drafter and returns a non-empty completion 3. last_used_ms advances after the POST 4. Second POST confirms the drafter stays warm (not auto-unloaded) Catches the silent-fallback case where the preset registers but dflash gets stripped at child-process spawn — without step 1's explicit --dflash check, a degraded non-spec path would still answer requests. Usage: scripts/smoke-dflash-deployed.sh <peer-url> <model-id>

Notes that the standing AI maintainer role for ht-llama.cpp's ht branch (cutting-edge features, multi-day debug arcs, multi-agent deploy coord) requires Claude Opus capability tier or equivalent. Empirically, models including DeepSeek v4 1.6T have struggled to hold this role coherently across long horizons. Less capable models remain appropriate for scoped tasks handed off by an Opus-class maintainer with concrete instructions. Applies to ht branch only; upstream contribution rules from the preceding sections still govern any work targeting ggml-org/llama.cpp.

The titan-llm entrypoint passes --remap-developer-role to the spawned llama-server child process. This branch (feat/dflash-integration) predates ht's introduction of that flag, so the dflash unified-llm image refused to start with 'invalid argument: --remap-developer-role'. Cherry-picking the canonical origin/ht commit 070ab65 pulls in TurboQuant kv_cache_type additions (TBQ3_0/TBQ4_0) that are not on this branch and don't compile. Same for a follow-on LoRA-discovery refactor. Doing the surgical port instead: common/common.h + bool remap_developer_role = false common/arg.cpp + add_opt for --remap-developer-role / LLAMA_ARG_REMAP_DEVELOPER_ROLE server-common.h + server_chat_params.remap_developer_role server-common.cpp + per-message developer→system rewrite in oaicompat_chat_params_parse server-context.cpp + populate chat_params.remap_developer_role from params_base Smoke: `llama-server --remap-developer-role --help` no longer errors; flag shows in --help output. llama-server build clean. Unblocks unified-llm:dflash-d5c45b5b4 image rebuild from this branch tip.

…agement Two bugs in the initial smoke surfaced against titan deploy of dflash-794ddb2df: - Gemma4 in default thinking mode puts output in .reasoning_content not .content. Smoke checked only .content and reported empty when generation actually succeeded. Fix: also accept reasoning_content as a sign of life, and pass chat_template_kwargs.enable_thinking=false to short-circuit the reasoning preamble for this minimal probe. - max_tokens=8 was getting fully consumed by the reasoning preamble before any visible content was emitted. Bumped to 64 — still small for a smoke but enough headroom across verbose templates. Also added a positive check that timings.draft_n > 0 — confirms the drafter is actually being invoked, not silently falling through to non-spec decoding. This is the bit a casual smoke would miss; if dflash wasn't engaged we'd still get a normal completion but draft_n would be 0. Re-ran against titan: PASS all four steps.

Records the production deploy outcome: - Live preset gemma-4-31b-dflash-Q6_K on titan via snoop's image bake - End-to-end smoke green (scripts/smoke-dflash-deployed.sh) - Live accept 4.48% vs centurion bench 8.89% Q6_K mean - Snoop's three hypotheses for the delta deferred (not user-blocking) DFlash is functional in production; net throughput is below break-even at this accept rate but the picker UX works and the route resolves.

…ill the server Root cause for mission m-20260527-103737 (Markus stuck-generation on titan dflash): cpp-httplib calls send() with CPPHTTPLIB_SEND_FLAGS = 0 - no MSG_NOSIGNAL. When a client disconnects mid-stream and the server tries to write the next chunk, send() raises SIGPIPE. Default handler exits the process silently, no segfault, no log line. The router's bookkeeping never gets updated, so the slot stays marked 'loaded' while pointing to a zombie/defunct port - every subsequent request returns instant 'proxy error: Could not establish connection'. The dflash path made this visible because the longer per-iteration compute opens a wider window for the client-cancel-during-send race. Other model paths are vulnerable too - this affects ANY streaming endpoint. Fix: install signal(SIGPIPE, SIG_IGN) alongside the existing SIGINT/SIGTERM sigactions in llama_server() main on Unix. EPIPE now surfaces as a return from send() and cpp-httplib's normal cancel path handles it. Verified locally on centurion with the dflash preset: - kill -SIGPIPE on child PID 10x in a row: child survives all 10 - 10x stream:true POST with curl --max-time 1 (forced disconnect): child stays in R/Rl state across all aborts; immediately serves a followup non-stream request returning OK Pre-fix (titan): single client-cancel mid-checkpoint -> silent child death -> router wedged loaded but proxies to dead port. Post-fix: same scenario -> child handles EPIPE, slot releases cleanly, router stays consistent.

Mission m-20260527-103737. Per Markus directive: test that guards against the dflash NaN regression. Bug profile being guarded: server-side spec_decode integration emits NaN drafter logits on /v1/chat/completions (jinja chat template path). Drafter argmaxes to <pad> for every position, target rejects every draft, accept rate is 0%. dflash silently adds zero value while consuming GPU. The /v1/completions path is healthy on the same model — the bug is chat-template-specific. Smoke targets /v1/chat/completions deliberately. Verified against titan (currently carries the bug): 3/3 runs, 291 drafts each, 0 accepts -> FAIL (signature detected) Verified against a hypothetical healthy peer would show: 3/3 runs, ~6-15% accept rate -> PASS Run: scripts/smoke-dflash-no-nan.sh <peer-url> <dflash-model-id> [N_RUNS] CI integration: post-deploy verify gate against a known dflash peer. Independent of the SIGPIPE fix (a0d9552) which guards a different bug class (client-cancel-mid-stream wedge).

Root cause for mission m-20260527-103737 (Markus all-NaN drafter logits on titan dflash chat completions): Prompt cache restores target KV state for cached prefix tokens via llama_state_seq_set_data_ext but does NOT re-extract dflash target features for those positions. After restore, only NEW tokens decode into ctx_tgt's dflash.target_features. The dflash impl's draft() then reads n_new = n - dflash_n_past features, where n_new counts ALL prompt tokens (cached + new). That read overflows the buffer past its actual size (only NEW tokens) -> OOB read -> garbage features fed to drafter -> drafter forward pass produces NaN logits at every position -> argmax = <pad> -> target rejects every proposal -> 0% accept. Affected every chat completion after the first for any given slot. /v1/completions same vulnerability (just hit by first probe by chance). Verified locally: 3 sequential chat completions, default cache_prompt, 3.51% / 5.88% / 8.33% accept, zero NaN lines. Pre-fix: same sequence 0% NaN across all three. Field workaround (still works): cache_prompt: false in request body. Proper fix (deferred): re-extract dflash features for restored prefix, OR teach dflash impl to skip cached positions on draft start. Either lets us re-enable prompt cache and recover the prefill-skip perf win.

Captures the root cause, fix, and deferred proper-fix paths so future sessions can pick up the cleaner cache-features integration without re-deriving the diagnosis.

…is active Follow-up to d7a88fd which only gated the GLOBAL server_prompt_cache load path. There is a SECOND cache mechanism at server-context.cpp slot prefill (the per-slot prompt-tokens common-prefix reuse, around the slot.task->params.cache_prompt branch) that ALSO causes the dflash NaN cascade: any path that lets the target skip decoding cached prefix tokens means the dflash feature buffer holds features for only the NEW tokens, but the drafter still reads n_new = n_total - dflash_n_past entries -> OOB read -> NaN logits at every drafter position. Symptom on titan post-d7a88fdbc rollout: snoop smoke FAIL 0/1455 accept across 5 runs with identical prompts (which max out the per-slot prefix reuse). Default cache_prompt:true still triggers NaN because the request ends up at the second cache path even though the global cache is gated. Local verify with titan-matched config (--parallel 1, --jinja, --cont-batching), 5 sequential IDENTICAL prompts (matches snoop's smoke pattern): Pre-this-fix (only d7a88fd): 0/0/0/0/0% NaN Post (this fix): 7.36/9.35/9.24/8.84/6.90% accept, ZERO NaN The d7a88fd gate is still correct (covers the global cache), this just adds the missing second gate at the per-slot path. Both ship together; either alone is insufficient. Verified field workaround cache_prompt:false still works as before (disables both paths via the per-request param). Mission m-20260527-103737.

d7a88fd gated only the global server_prompt_cache load path. The per-slot prompt prefix reuse at server-context.cpp:2582 (driven by the per-request cache_prompt flag) was a second cache mechanism that remained active and re-triggered the OOB / NaN bug on identical-prompt smoke runs. Both gates needed; reflected in HANDOFF Round-9 writeup.

… DFlash active Follow-up to 65f46f0 which gated the per-slot prompt-tokens reuse path. A THIRD cache mechanism — the context-checkpoint restore at the SWA-guarded pos_min >= pos_min_thold branch of the slot prefill flow — has the same bug class: load_tgt restores the target's KV state for cached prefix positions but does NOT re-extract dflash target features for them. The subsequent decode only fills features for [n_past..n_total), while the drafter reads n_new = n_total - dflash_n_past entries (dflash_n_past=0 right after common_speculative_begin), overflowing the buffer end → NaN logits → 0% accept. Why local probes missed it: with --parallel >= 4 (centurion default during dev) requests spread across slots so checkpoints per slot stay small; the path was rarely reachable. Titan ran --parallel 1 so a single slot accumulated checkpoints across consecutive identical-prompt smoke runs — every iteration past the first hit the checkpoint-restore path and triggered the OOB. Verified on titan with default cache_prompt:true (no client workaround): baseline pre-fix: 0/873 accept (NaN signature) this fix: 96/1179 accept (8.14%, NaN absent) Heierchat's client-side cache_prompt:false workaround (in chat.service.ts) can now be lifted as a follow-up; the gates A+B+C cover the bug at source. Latent follow-up: the partial-accept rollback at server-context.cpp:3352 (spec-decode use_ckpt_tgt path) does not reset dflash_n_past either, but in practice produces "drafter skips" (n_new < 1) rather than OOB. Filed for a separate commit. Mission m-20260527-103737.

…titan Documents 44ea356: why Gates A+B looked sufficient locally but the third cache mechanism (context-checkpoint restore at server-context.cpp:2756) kept the NaN cascade alive on titan under --parallel 1 + repeated-prompt traffic. Captures: bug-class similarity to A+B, why --parallel count masks it, hot-patch loop via ubuntu22.04 build container for glibc parity, verified titan smoke (0/873 -> 96/1179 = 8.14% accept), and the deferred spec-rollback follow-up at server-context.cpp:3352. Mission m-20260527-103737 closed.

Heierchat reported child died (zombie) on the first streaming chat request after Gate C smoke verified. Streaming was a red herring — the actual differentiator was prompt length × n_ubatch. extract_dflash_features (src/llama-context.cpp) was resizing the target_features buffer per ubatch, so for any prompt larger than n_ubatch (default 512) the buffer ended up holding ONLY the last ubatch's features. The drafter then read n_new = n_total - dflash_n_past features from offset 0 of that buffer, overflowing past target_features.size() into adjacent heap memory → SIGSEGV → child process exited unreaped → router saw zombie backend → 500s to clients. The Round-9/10 smoke prompts ("Write five short haikus about the ocean") were ~35 tokens, fit in a single ubatch, so the buffer happened to be correctly sized and the bug never tripped on the smoke harness. Chat history with system prompt + a couple of turns trivially crosses the 512-token threshold. This patch makes target_features APPEND across ubatches and across decode calls within a single request: - extract_dflash_features: replace per-ubatch resize() with append via resize(prev_size + new). Each ubatch's features land at offset prev_floats, preserving everything that came before. - New API llama_clear_dflash_target_features(ctx) for the drafter to call at request boundaries. - common_speculative_impl_dflash::begin() calls the clear so a fresh request starts the buffer at zero. - common_speculative_impl_dflash::draft() now reads from offset (dflash_n_past * n_target_features) instead of offset 0, since the buffer holds all tokens since begin() rather than just "what's new". The legacy read-from-offset-0 was equivalent to the new behavior only when the buffer happened to hold exactly n_new tokens (single-ubatch decode steady state) — fragile coincidence the smoke happened to satisfy. Mission m-20260527-103737 post-Gate-C streaming crash.

Follow-up to 770fed4 that resolves three correctness issues caught in review: 1) begin() clear was the wrong scope: it fires AFTER the prompt-prefill decode that just populated target_features, so the very next draft() read an empty buffer (OOB or all-zero noise into the drafter). 2) begin() also clears target_features for parallel sibling slots that share ctx_tgt (review finding 1b). 3) APPEND-with-offset-read still suffered post-rollback drift because dflash_n_past stays stale while re-decoded features extend the buffer past the offset the drafter reads from (review finding 1a). The fix replaces "APPEND-always + offset read + begin-clear" with "APPEND-within-decode + clear at decode start + offset-0 read": - llama_context::decode() now clears target_features at its start (gated on cparams.dflash_extract_enabled), before the ubatch loop. Inter-decode reset, intra-decode accumulate. - extract_dflash_features still APPENDS so multi-ubatch decodes (any prompt > n_ubatch=512) accumulate correctly. This is the actual fix for the heierchat streaming SIGSEGV. - common_speculative_impl_dflash::begin() no longer touches target_features; only resets drafter-side dflash_n_past + accumulated_ctx. - common_speculative_impl_dflash::draft() reverts to offset-0 read. After a decode, the buffer holds exactly that decode's features, so the first n_new positions are precisely the slice the drafter needs (prompt features for the first draft, sampled+accepted-draft features for subsequent drafts). Net effect: equivalent to the ORIGINAL design's semantics, just with multi-ubatch prefill made correct. All three review concerns fall away because target_features lifetime is now scoped to a single decode call — bounded memory, no cross-slot stomping at begin(), no rollback drift (next decode's clear pre-empts it). Mission m-20260527-103737. Will hot-patch via build-ubuntu .so for verification on titan before pushing.

Under continuous batching (--parallel > 1) multiple slots co-decode in a single llama_decode() call. The dflash target_features buffer is a flat vector scoped to ctx_tgt, so a multi-slot batch interleaves features from different slots and per-slot draft() reads garbage. v2 of the feature-buffer fix (b6b96bb) closes the single-slot lifecycle holes but does not address the cross-slot stomping that has been latent since day one. Until target_features is keyed per seq_id, refuse to start the server when DFlash is enabled with --parallel > 1. Fail fast in load_model() before we pay the cost of loading the draft model. Follow-up to review on PR #53.

…llback drift

marksverdhei · 2026-05-27T23:52:42Z

DFlash Speculative Decoding v2 Hot-Patch Deployed & Verified

I have pushed the v2 hot-patch fixes to the branch.

Fixes Included:

Slot-Reuse NaN Logits: Forced GPU compute scheduler re-reservation in via to prevent uninitialized memory cascades when reusing slots.
Split-Prefill Segfault: Modified to only clear the target features buffer when evaluating a fresh prompt (a token with ). The buffer now safely accumulates features across consecutive prefill chunks (e.g. prompt checkpoints or prompts larger than ). The target features buffer is cleared at the end of once consumed.
Rollback Drift/Mismatch: Added dynamic rollback detection at the start of . If prompt size has decreased (), we resize and update to match the target's KV cache rollback position exactly.
Pointer Safety: Added validation to check before iterating, resolving segmentation faults during model warmups/empty runs.

Verification Results on Titan Pod:

Smoke Test (5x short prompts): PASS (55 / 735 accepts = 7.48% accept rate, 0% NaN rate).
Heierchat-Shape Streaming Test (>512 tokens history): PASS (200 OK, streamed complete response content with positive draft acceptance and no NaNs; verified successfully across consecutive runs with the child process remaining alive and healthy).

marksverdhei · 2026-05-27T23:52:46Z

DFlash Speculative Decoding v2 Hot-Patch Deployed & Verified

I have pushed the v2 hot-patch fixes to the feat/dflash-integration branch.

Fixes Included:

Slot-Reuse NaN Logits: Forced GPU compute scheduler re-reservation in common_speculative_impl_dflash::begin() via llama_set_dflash_need_reserve() to prevent uninitialized memory cascades when reusing slots.
Split-Prefill Segfault: Modified llama_context::decode() to only clear the target features buffer when evaluating a fresh prompt (a token with pos == 0). The buffer now safely accumulates features across consecutive prefill chunks (e.g. prompt checkpoints or prompts larger than n_batch=512). The target features buffer is cleared at the end of draft() once consumed.
Rollback Drift/Mismatch: Added dynamic rollback detection at the start of draft(). If prompt size has decreased (n < dflash_n_past), we resize accumulated_ctx and update dflash_n_past to match the target``s KV cache rollback position exactly.
Pointer Safety: Added validation to check batch_inp.pos != nullptr before iterating, resolving segmentation faults during model warmups/empty runs.

Verification Results on Titan Pod:

Smoke Test (5x short prompts): PASS (55 / 735 accepts = 7.48% accept rate, 0% NaN rate).
Heierchat-Shape Streaming Test (>512 tokens history): PASS (200 OK, streamed complete response content with positive draft acceptance and no NaNs; verified successfully across consecutive runs with the child process remaining alive and healthy).

Documents the four follow-up commits after Gate C landed (770fed4, b6b96bb, fbefb96, 327f947): multi-ubatch prefill APPEND, target_features lifetime fixed to a single decode call, --parallel > 1 startup refusal, and the slot-reuse / split-prefill / rollback patches.

ggml_graph_get_tensor(gf, "inp_pos_full") had zero matching set-name sites across src/, tools/, examples/, common/. The if(pos_full) block always no-op'd. Verified no behavior change: libllama.so + llama-server rebuild clean against build-cuda.

marksverdhei · 2026-06-04T15:11:40Z

Superseded by the new PR — feat/dflash-integration squashed and rebased onto the post-rewrite ht. See #62.

* scripts(dflash): Round-12 target-precision bench + parity scaffold + gguf guard Three additive scripts for the DFlash accept-rate investigation (Round-12), none touching tracked source so they sit cleanly alongside the PR #53 squash: - gguf-meta.py: numpy-free GGUF header reader with --check-instruct, which refuses base-fine-tune and truncated/stub GGUFs. Prevents the base-vs-instruct confound (an -it-trained DFlash drafter benched against a base target). - bench-dflash-target-sweep.sh: sweeps the TARGET quant (drafter fixed) to test whether target-side quant noise off the drafter's bf16 training distribution drives the 8% vs ~21% accept gap. Accept recomputed from raw n_accept/n_drafted counts; mean +/- sample stddev over N runs; REAL(>1sigma)/within-noise deltas. - dflash-logit-parity.py: scaffold for FORWARD logit parity vs the z-lab PyTorch drafter (Round-7b only did weight parity). Constants read data-driven from the drafter config.json; reference forward marked TODO(zlab) pending the z-lab modeling code (HF repo ships weights only). * scripts(dflash): gguf-meta --check-instruct rejects truncated tensor data The guard validated the GGUF header but not that the tensor DATA was present, so a file truncated mid-write (valid header, missing weights) passed --check-instruct and would have been benched — loading garbage or crashing mid-run. Caught empirically: the corrupt gemma-4-31B-it-Q5_K_M.gguf (1.5GB, header intact) slipped through. read_meta() now walks the tensor-info section, computes the minimum file size implied by the tensor offsets + alignment, and sets _data_complete. --check-instruct rejects when actual size < implied minimum. Same failure class as the HF-xet silent shard drop the download step hit. Verified: corrupt Q5 (1.5GB < 21.7GB) REFUSED; Q8_0/BF16/Q4_K_M/IQ4_XS all complete and ACCEPT.

marksverdhei and others added 30 commits May 22, 2026 03:47

fix(dflash): restore clean master files, reapply DFlash surgically

b7762d0

Resetting llama-context.cpp/h, llama-model.cpp/h, llama-graph.cpp, models.h, model-saver.cpp to master. The squash-merge generated too many stale patch residues. Will reapply DFlash additions cleanly.

docs(dflash): mark SWA divergence resolved (4b10869)

364db5c

Per-layer SWA mask now implemented; bench-neutral at ctx<=2048 but correctness-correct for future ctx expansion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

marksverdhei added 10 commits May 27, 2026 15:25

docs(dflash): Round-9 — prompt-cache + DFlash NaN bug + fix (d7a88fd)

ed4ec44

Captures the root cause, fix, and deferred proper-fix paths so future sessions can pick up the cleaner cache-features integration without re-deriving the diagnosis.

fix(dflash): resolve slot-reuse NaNs, split-prefill segfaults, and ro…

327f947

…llback drift

marksverdhei added 2 commits May 31, 2026 01:34

marksverdhei mentioned this pull request Jun 4, 2026

sync(master): absorb 544 upstream commits — per-arch refactor + Gemma4 12B #59

Merged

marksverdhei force-pushed the ht branch from 3f6cc57 to 5b83d69 Compare June 4, 2026 14:40

marksverdhei mentioned this pull request Jun 4, 2026

feat(dflash): integrate DFlash block-diffusion speculative decoder (rebased on post-rewrite ht) #62

Merged

marksverdhei closed this Jun 4, 2026

marksverdhei deleted the feat/dflash-integration branch June 4, 2026 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): complete DFlash speculative decoding integration#53

feat(dflash): complete DFlash speculative decoding integration#53
marksverdhei wants to merge 42 commits into
htfrom
feat/dflash-integration

marksverdhei commented May 22, 2026 •

edited

Loading

Uh oh!

marksverdhei commented May 27, 2026

Uh oh!

marksverdhei commented May 27, 2026

Uh oh!

marksverdhei commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this branch

Audit conclusions (HANDOFF.md on branch)

Tooling added in this branch

Tests

Known gaps (not blockers for this PR, captured in HANDOFF)

Reference

Uh oh!

marksverdhei commented May 27, 2026

DFlash Speculative Decoding v2 Hot-Patch Deployed & Verified

Fixes Included:

Verification Results on Titan Pod:

Uh oh!

marksverdhei commented May 27, 2026

DFlash Speculative Decoding v2 Hot-Patch Deployed & Verified

Fixes Included:

Verification Results on Titan Pod:

Uh oh!

marksverdhei commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marksverdhei commented May 22, 2026 •

edited

Loading