feat(dflash): integrate DFlash block-diffusion speculative decoder (rebased on post-rewrite ht) by marksverdhei · Pull Request #62 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-04T15:11:38Z

Summary

Replaces #53. The previous `feat/dflash-integration` branch (155 commits across 11 Round-N iterations) is collapsed into a single feature commit and rebased onto the post-rewrite `ht` (commit 5b83d69, which sits on top of upstream master 0066404 with the per-arch class hierarchy now baseline).

What landed

One commit, 35 files changed, +2819/-27.

Surface	Notes
`src/models/dflash.cpp` (new, 301 LOC)	`llama_model_dflash` — block-diffusion drafter, per-arch class
`src/llama-arch.{cpp,h}`, `src/llama-model{.cpp,.h,-loader.cpp}`	LLM_ARCH_DFLASH registered + tensor-load hooks + std::array<int,5> instantiation for noise schedule
`src/llama-graph.{cpp,h}`	`llm_graph_input_dflash` ubatch wiring, SWA-aware kq mask
`src/llama-cparams.h`, `src/llama-hparams.h`, `src/llama-context.h`, `src/llama.cpp`	cparams flag + ctx wiring + dispatch entry
`src/models/{gemma4,llama,models}.{cpp,h}`	per-arch hooks so existing targets cooperate with the drafter
`common/speculative.{cpp,h}`	`COMMON_SPECULATIVE_TYPE_DFLASH`; `has_draft_simple` auto-enable is now dflash/eagle3/mtp-aware
`tools/server`	router integration: `last_used_ms`, pick_any_resident's dflash flag, server-task/queue/server.cpp glue. `--parallel 1` gate kept per current Round-11 status (#107 tracks the per-seq_id path-B unblock for parallel >1)
`tests/test-dflash.cpp` + `scripts/*` + `HANDOFF.md`	Smoke harness, bench, weight-compare, Round-11 handoff

Conflict resolutions vs the previous attempt

`swa_layers` → `is_swa_impl` rename (upstream restructured `llama_hparams`).
Single `remap_developer_role` in `server_chat_params` (both branches independently added it).
AGENTS.md kept the downstream-focused resource list and the AI-maintainer-tier note; dropped the upstream anti-AI policy block (doesn't apply to this fork).
print_mask is now templated upstream; kept the templated signature and dropped dflash's pre-template duplicate.

Verified

✅ `cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON` configures clean.
✅ `cmake --build build --target llama-server` builds end-to-end (100%).
✅ One harmless `-Wswitch` warning for `COMMON_SPECULATIVE_TYPE_DFLASH` in an unrelated switch — pre-existing on the dflash branch.

Follow-ups

CUDA build verification — this PR was CPU-only verified.
Round-11 lifecycle bug (Task chore(sync): upstream master → ht (114-commit sync 2026-06-13) #107): per-seq_id `target_features` for `--parallel` > 1 still open.
Splitting into ~5 logical commits (Task DFlash: child worker crashes on first decode (n_outputs_max family) — still broken after upstream sync #108) — held; this is single-commit-squash for now to keep the rebase tractable. Can split later if Markus prefers a 5-commit narrative.

Adds DFlash speculative decoding as a per-arch model class: * src/models/dflash.cpp (new): `llama_model_dflash` — block-diffusion drafter that proposes N tokens per step against the target model's context-conditional embeddings. SWA-aware attention mask, n_block noise tokens layered against ctx tokens, per-layer is_swa_impl routing. * src/llama-arch.{cpp,h}: LLM_ARCH_DFLASH registered (NeoX rope). * src/llama-graph.{cpp,h}: `llm_graph_input_dflash` carries target hidden + masks, sets them on the graph each ubatch. * src/llama-model{.cpp,.h,-loader.cpp}: arch dispatch, tensor loader hooks, std::array<int,5> instantiation for the noise schedule. * src/llama-cparams.h + llama-hparams.h + llama-context.h + llama.cpp: cparams flag + ctx wiring + arch-dispatch entry. * src/models/{gemma4,llama,models}.* : hooks so the existing target archs cooperate with the dflash drafter. * common/speculative.cpp + speculative.h: COMMON_SPECULATIVE_TYPE_DFLASH config; auto-enable logic that gates 'draft' type on actual draft- model path presence while leaving room for dflash / eagle3 / mtp. * tools/server: router-aware integration (last_used_ms field, pick_any_resident wired with dflash flag), server-task.h/cpp + queue + server.cpp glue, --parallel 1 gate left in place per current Round-11 status. * tests/test-dflash.cpp + scripts/* + HANDOFF.md: smoke harness, bench, weight-compare, regression artifacts. Squashed from 42 commits on the feat/dflash-integration branch (the previous Round-11 lifecycle iterations). Rebased onto the post-rewrite ht baseline; conflicts resolved against the per-arch class hierarchy that's now upstream-stock (renamed swa_layers → is_swa_impl, single remap_developer_role definition in server_chat_params, has_draft_simple auto-enable kept dflash-aware).

marksverdhei · 2026-06-04T16:03:32Z

CI status (self-update from maintainer)

Pushed two CI fixes since the initial PR open:

-Werror=switch on arm64: common/speculative.cpp:1528 switch over common_speculative_type was missing COMMON_SPECULATIVE_TYPE_DFLASH — added to the same case bucket as DRAFT_SIMPLE/EAGLE3/MTP.
hparams.swa_layers → hparams.is_swa_impl rename also needed in tests/test-dflash.cpp (had only caught it in src/models/dflash.cpp).
test-llama-archs dflash skip: added LLM_ARCH_DFLASH to the existing arch_supported() skip list. DFlash is a drafter, not a target model — the synthetic-GGUF roundtrip test cant supply dflash.target_layer_ids and the loader rightly throws. Same pattern as the existing LLM_ARCH_GEMMA4 and LLM_ARCH_DEEPSEEK2OCR skips.

Pre-existing failure not fixed in this PR

test-quantize-fns aborts with *** stack smashing detected *** on ubuntu-24.04-arm. This was ALREADY failing on the master-sync PR #59 (same job name) — i.e. its a TurboQ-on-arm64 regression introduced earlier, before this PR. Worth a separate task to track.

CI rerunning on 18d4e37b9.

The Backend & quantization table omitted two HT-specific speculative decoding features that have shipped to ht: - DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for partial-accept feature extraction) — landed via PR #62 (b0daec5), integrates the z-lab DFlash block-diffusion drafter against Gemma4 31B targets. - Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored via PR #93 (4c09765) ahead of upstream PR ggml-org#23398 merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked with Tracked-upstream=ggml-org#23398 since it retires when that PR merges and flows through a normal master sync. Found during a §7 documentation freshness sweep — the inventory exists to be authoritative ("consult it before assuming a behaviour is upstream stock" per AGENTS.md), so omissions defeat the purpose. Docs-only, no code touched. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>

marksverdhei mentioned this pull request Jun 4, 2026

feat(dflash): complete DFlash speculative decoding integration #53

Closed

marksverdhei force-pushed the sync/dflash-rebase branch 2 times, most recently from 4c6873c to 88ad52a Compare June 4, 2026 15:47

marksverdhei force-pushed the sync/dflash-rebase branch from 88ad52a to 18d4e37 Compare June 4, 2026 16:01

This was referenced Jun 4, 2026

fix(tests): TBQ block-size + tolerances after 128-block migration #63

Merged

fix(server): portable exit-code on subprocess-alive=false path (windows build break) #64

Merged

marksverdhei merged commit b0daec5 into ht Jun 4, 2026
1 of 9 checks passed

marksverdhei deleted the sync/dflash-rebase branch June 4, 2026 17:01

marksverdhei mentioned this pull request Jun 7, 2026

docs(readme): inventory DFlash + Gemma4 MTP under HT Fork Changes #96

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): integrate DFlash block-diffusion speculative decoder (rebased on post-rewrite ht)#62

feat(dflash): integrate DFlash block-diffusion speculative decoder (rebased on post-rewrite ht)#62
marksverdhei merged 1 commit into
htfrom
sync/dflash-rebase

marksverdhei commented Jun 4, 2026

Uh oh!

marksverdhei commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 4, 2026

Summary

What landed

Conflict resolutions vs the previous attempt

Verified

Follow-ups

Uh oh!

marksverdhei commented Jun 4, 2026

CI status (self-update from maintainer)

Pre-existing failure not fixed in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant