DFlash (Block Diffusion for Flash Speculative Decoding) #21569

wsbagnsv1 · 2026-04-07T16:34:06Z

wsbagnsv1
Apr 7, 2026

Just saw this new repo on r/locallama and its seems promising, they (z-lab) claim to have a decent speedup even over eagle3, which is still a draft pr here. The idea is basically to not do a autoregressive draft model but a diffusion one to speed things up and ofc the stats vary but for Qwen3.5-35B-A3B for example they claim a 2-2.8x speedup over normal token generation with an acceptance length mostly beating even mtp other models are at times even better like qwen3.5 9b with up to 3.5x, as can be seen in their repo and huggingface. Im not sure if the devs here think its worth it rn with eagle3 having a draft pr already so ive created this as a discussion to just bring this up (;
The repo is available here
and the already available draft models can be found here

jschoch · 2026-04-07T16:48:07Z

jschoch
Apr 7, 2026

there are a bunch of MTP PR's but they don't seem to be making much progress and seem to be EAGLE 3 focused. I'm also watching this diffusion draft approach closely.

1 reply

wsbagnsv1 Apr 7, 2026
Author

yeah thats why i didnt want to crowd the issues tab with another draft feature request although this one seems pretty neat (;

ruixiang63 · 2026-04-19T03:10:31Z

ruixiang63
Apr 19, 2026

For anyone interested in DFlash, please take a look at this PR: #22105 Now it comes to llama.cpp :)

6 replies

ruixiang63 Apr 19, 2026

I’ve added dflash support for Qwen3.5/3.6 MoE. However, please note that the performance is currently not optimal. This is mainly due to the MoE architecture combined with the hybrid structure used in Qwen3.5/3.6 MoE, which is not yet well supported by llama.cpp. For more details on the performance limitations and underlying issues, please refer to my shared PR: #22105

lym000000 Apr 19, 2026

@ruixiang63 Thanks for the update!

Performance issue: noted on the MoE + hybrid architecture limitations with llama.cpp. Will keep an eye on #22105 for progress.

Suggestion: saw that the ggml team has been working on refactoring the codebase (e.g., CMake glob changes and bias tensor renaming). It might be easier for the ggml team if the PR is rebased on the latest upstream master to avoid conflicts with those structural changes.

lym000000 Apr 20, 2026

Don’t see it mentioned here yet, there's also DDTree for accelerated speculative decoding on top of DFlash.

Page: https://liranringel.github.io/ddtree/
Code: https://github.com/liranringel/ddtree
Paper: https://arxiv.org/abs/2604.12989

ruixiang63 Apr 20, 2026

Yeah, thanks. My thought is to get the Eagle3 PR #18039 merged first, since the API refactoring also needs to happen there. My DFlash PR is based on Eagle3, so once that’s merged, I’ll rebase the DFlash PR on top of it.

SunYong0821 Apr 27, 2026

Yeah, thanks. My thought is to get the Eagle3 PR #18039 merged first, since the API refactoring also needs to happen there. My DFlash PR is based on Eagle3, so once that’s merged, I’ll rebase the DFlash PR on top of it.

Does it also work for dense models? like qwen3.6 27B

DFlash (Block Diffusion for Flash Speculative Decoding) #21569

Uh oh!

Replies: 2 comments · 7 replies

Uh oh!

Uh oh!

wsbagnsv1 Apr 7, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 2 comments 7 replies

wsbagnsv1 Apr 7, 2026
Author