Skip to content

feat(dflash): integrate DFlash block-diffusion speculative decoder (rebased on post-rewrite ht)#62

Merged
marksverdhei merged 1 commit into
htfrom
sync/dflash-rebase
Jun 4, 2026
Merged

feat(dflash): integrate DFlash block-diffusion speculative decoder (rebased on post-rewrite ht)#62
marksverdhei merged 1 commit into
htfrom
sync/dflash-rebase

Conversation

@marksverdhei

Copy link
Copy Markdown

Summary

Replaces #53. The previous `feat/dflash-integration` branch (155 commits across 11 Round-N iterations) is collapsed into a single feature commit and rebased onto the post-rewrite `ht` (commit 5b83d69, which sits on top of upstream master 0066404 with the per-arch class hierarchy now baseline).

What landed

One commit, 35 files changed, +2819/-27.

Surface Notes
`src/models/dflash.cpp` (new, 301 LOC) `llama_model_dflash` — block-diffusion drafter, per-arch class
`src/llama-arch.{cpp,h}`, `src/llama-model{.cpp,.h,-loader.cpp}` LLM_ARCH_DFLASH registered + tensor-load hooks + std::array<int,5> instantiation for noise schedule
`src/llama-graph.{cpp,h}` `llm_graph_input_dflash` ubatch wiring, SWA-aware kq mask
`src/llama-cparams.h`, `src/llama-hparams.h`, `src/llama-context.h`, `src/llama.cpp` cparams flag + ctx wiring + dispatch entry
`src/models/{gemma4,llama,models}.{cpp,h}` per-arch hooks so existing targets cooperate with the drafter
`common/speculative.{cpp,h}` `COMMON_SPECULATIVE_TYPE_DFLASH`; `has_draft_simple` auto-enable is now dflash/eagle3/mtp-aware
`tools/server` router integration: `last_used_ms`, pick_any_resident's dflash flag, server-task/queue/server.cpp glue. `--parallel 1` gate kept per current Round-11 status (#107 tracks the per-seq_id path-B unblock for parallel >1)
`tests/test-dflash.cpp` + `scripts/*` + `HANDOFF.md` Smoke harness, bench, weight-compare, Round-11 handoff

Conflict resolutions vs the previous attempt

  • `swa_layers` → `is_swa_impl` rename (upstream restructured `llama_hparams`).
  • Single `remap_developer_role` in `server_chat_params` (both branches independently added it).
  • AGENTS.md kept the downstream-focused resource list and the AI-maintainer-tier note; dropped the upstream anti-AI policy block (doesn't apply to this fork).
  • print_mask is now templated upstream; kept the templated signature and dropped dflash's pre-template duplicate.

Verified

  • ✅ `cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON` configures clean.
  • ✅ `cmake --build build --target llama-server` builds end-to-end (100%).
  • ✅ One harmless `-Wswitch` warning for `COMMON_SPECULATIVE_TYPE_DFLASH` in an unrelated switch — pre-existing on the dflash branch.

Follow-ups

Adds DFlash speculative decoding as a per-arch model class:

* src/models/dflash.cpp (new): `llama_model_dflash` — block-diffusion
  drafter that proposes N tokens per step against the target model's
  context-conditional embeddings. SWA-aware attention mask, n_block
  noise tokens layered against ctx tokens, per-layer is_swa_impl
  routing.
* src/llama-arch.{cpp,h}: LLM_ARCH_DFLASH registered (NeoX rope).
* src/llama-graph.{cpp,h}: `llm_graph_input_dflash` carries target
  hidden + masks, sets them on the graph each ubatch.
* src/llama-model{.cpp,.h,-loader.cpp}: arch dispatch, tensor loader
  hooks, std::array<int,5> instantiation for the noise schedule.
* src/llama-cparams.h + llama-hparams.h + llama-context.h +
  llama.cpp: cparams flag + ctx wiring + arch-dispatch entry.
* src/models/{gemma4,llama,models}.* : hooks so the existing target
  archs cooperate with the dflash drafter.
* common/speculative.cpp + speculative.h: COMMON_SPECULATIVE_TYPE_DFLASH
  config; auto-enable logic that gates 'draft' type on actual draft-
  model path presence while leaving room for dflash / eagle3 / mtp.
* tools/server: router-aware integration (last_used_ms field,
  pick_any_resident wired with dflash flag), server-task.h/cpp + queue
  + server.cpp glue, --parallel 1 gate left in place per current
  Round-11 status.
* tests/test-dflash.cpp + scripts/* + HANDOFF.md: smoke harness,
  bench, weight-compare, regression artifacts.

Squashed from 42 commits on the feat/dflash-integration branch (the
previous Round-11 lifecycle iterations). Rebased onto the post-rewrite
ht baseline; conflicts resolved against the per-arch class hierarchy
that's now upstream-stock (renamed swa_layers → is_swa_impl, single
remap_developer_role definition in server_chat_params, has_draft_simple
auto-enable kept dflash-aware).
@marksverdhei

Copy link
Copy Markdown
Author

CI status (self-update from maintainer)

Pushed two CI fixes since the initial PR open:

  1. -Werror=switch on arm64: common/speculative.cpp:1528 switch over common_speculative_type was missing COMMON_SPECULATIVE_TYPE_DFLASH — added to the same case bucket as DRAFT_SIMPLE/EAGLE3/MTP.
  2. hparams.swa_layershparams.is_swa_impl rename also needed in tests/test-dflash.cpp (had only caught it in src/models/dflash.cpp).
  3. test-llama-archs dflash skip: added LLM_ARCH_DFLASH to the existing arch_supported() skip list. DFlash is a drafter, not a target model — the synthetic-GGUF roundtrip test cant supply dflash.target_layer_ids and the loader rightly throws. Same pattern as the existing LLM_ARCH_GEMMA4 and LLM_ARCH_DEEPSEEK2OCR skips.

Pre-existing failure not fixed in this PR

test-quantize-fns aborts with *** stack smashing detected *** on ubuntu-24.04-arm. This was ALREADY failing on the master-sync PR #59 (same job name) — i.e. its a TurboQ-on-arm64 regression introduced earlier, before this PR. Worth a separate task to track.

CI rerunning on 18d4e37b9.

@marksverdhei marksverdhei merged commit b0daec5 into ht Jun 4, 2026
1 of 9 checks passed
@marksverdhei marksverdhei deleted the sync/dflash-rebase branch June 4, 2026 17:01
marksverdhei added a commit that referenced this pull request Jun 12, 2026
The Backend & quantization table omitted two HT-specific speculative
decoding features that have shipped to ht:

- DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for
  partial-accept feature extraction) — landed via PR #62 (b0daec5),
  integrates the z-lab DFlash block-diffusion drafter against Gemma4
  31B targets.

- Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored
  via PR #93 (4c09765) ahead of upstream PR ggml-org#23398
  merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked
  with Tracked-upstream=ggml-org#23398 since it retires when that PR merges and
  flows through a normal master sync.

Found during a §7 documentation freshness sweep — the inventory exists
to be authoritative ("consult it before assuming a behaviour is
upstream stock" per AGENTS.md), so omissions defeat the purpose.

Docs-only, no code touched.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant