feat(dflash): integrate DFlash block-diffusion speculative decoder (rebased on post-rewrite ht)#62
Merged
Merged
Conversation
4c6873c to
88ad52a
Compare
Adds DFlash speculative decoding as a per-arch model class:
* src/models/dflash.cpp (new): `llama_model_dflash` — block-diffusion
drafter that proposes N tokens per step against the target model's
context-conditional embeddings. SWA-aware attention mask, n_block
noise tokens layered against ctx tokens, per-layer is_swa_impl
routing.
* src/llama-arch.{cpp,h}: LLM_ARCH_DFLASH registered (NeoX rope).
* src/llama-graph.{cpp,h}: `llm_graph_input_dflash` carries target
hidden + masks, sets them on the graph each ubatch.
* src/llama-model{.cpp,.h,-loader.cpp}: arch dispatch, tensor loader
hooks, std::array<int,5> instantiation for the noise schedule.
* src/llama-cparams.h + llama-hparams.h + llama-context.h +
llama.cpp: cparams flag + ctx wiring + arch-dispatch entry.
* src/models/{gemma4,llama,models}.* : hooks so the existing target
archs cooperate with the dflash drafter.
* common/speculative.cpp + speculative.h: COMMON_SPECULATIVE_TYPE_DFLASH
config; auto-enable logic that gates 'draft' type on actual draft-
model path presence while leaving room for dflash / eagle3 / mtp.
* tools/server: router-aware integration (last_used_ms field,
pick_any_resident wired with dflash flag), server-task.h/cpp + queue
+ server.cpp glue, --parallel 1 gate left in place per current
Round-11 status.
* tests/test-dflash.cpp + scripts/* + HANDOFF.md: smoke harness,
bench, weight-compare, regression artifacts.
Squashed from 42 commits on the feat/dflash-integration branch (the
previous Round-11 lifecycle iterations). Rebased onto the post-rewrite
ht baseline; conflicts resolved against the per-arch class hierarchy
that's now upstream-stock (renamed swa_layers → is_swa_impl, single
remap_developer_role definition in server_chat_params, has_draft_simple
auto-enable kept dflash-aware).
88ad52a to
18d4e37
Compare
Author
CI status (self-update from maintainer)Pushed two CI fixes since the initial PR open:
Pre-existing failure not fixed in this PR
CI rerunning on |
This was referenced Jun 4, 2026
3 tasks
marksverdhei
added a commit
that referenced
this pull request
Jun 12, 2026
The Backend & quantization table omitted two HT-specific speculative decoding features that have shipped to ht: - DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for partial-accept feature extraction) — landed via PR #62 (b0daec5), integrates the z-lab DFlash block-diffusion drafter against Gemma4 31B targets. - Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored via PR #93 (4c09765) ahead of upstream PR ggml-org#23398 merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked with Tracked-upstream=ggml-org#23398 since it retires when that PR merges and flows through a normal master sync. Found during a §7 documentation freshness sweep — the inventory exists to be authoritative ("consult it before assuming a behaviour is upstream stock" per AGENTS.md), so omissions defeat the purpose. Docs-only, no code touched. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces #53. The previous `feat/dflash-integration` branch (155 commits across 11 Round-N iterations) is collapsed into a single feature commit and rebased onto the post-rewrite `ht` (commit 5b83d69, which sits on top of upstream master 0066404 with the per-arch class hierarchy now baseline).
What landed
One commit, 35 files changed, +2819/-27.
Conflict resolutions vs the previous attempt
Verified
Follow-ups