[Feature] Spec V2 DFlash Support by dcw02 · Pull Request #23000 · sgl-project/sglang

dcw02 · 2026-04-16T22:51:52Z

Motivation

Add spec v2 to DFlash

Benchmarks

Run on gcp b200:8 node, using a gsm8k sweep script, qwen3-8b target, z-lab/Qwen3-8B-DFlash-b16 draft model, trtllm_mha target attention, fa4 draft attention, piecewise cuda graphs on.

v1 performance

DFLASH output tok/s
tp\conc       1         32
-------  ------  ---------
      1  845.98  11,405.85

DFLASH accuracy
tp\conc      1     32
-------  -----  -----
      1  0.852  0.844

DFLASH acceptance length (mean spec_accept_length)
tp\conc      1     32
-------  -----  -----
      1  6.345  6.487

v2 performance

DFLASH output tok/s
tp\conc         1         32
-------  --------  ---------
      1  1,075.48  13,022.18

DFLASH accuracy
tp\conc      1     32
-------  -----  -----
      1  0.852  0.844

DFLASH acceptance length (mean spec_accept_length)
tp\conc      1     32
-------  -----  -----
      1  6.352  6.482

this spec v2 version also brings in some extra optimizations compared to #20547 which brought bs1 performance from 900 -> 1075 tok/s and bs32 from 12,300 -> 13,000 tok/s.

Benchmarking is done with this script using the command SGLANG_ENABLE_SPEC_V2=1 SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 python benchmark/dflash/bench_dflash_gsm8k_sweep.py --skip-baseline --tp-sizes 1 --concurrencies 1,32 --attention-backends trtllm_mha --speculative-draft-attention-backend fa4 on 1xB200

i removed mamba memory calculations to add later once i figure out the best way to do that

gemini-code-assist · 2026-04-16T22:51:56Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

dcw02 · 2026-04-16T22:54:13Z

/rerun-test test/registered/spec/dflash/test_dflash.py

github-actions · 2026-04-16T22:54:46Z

✅ 1-gpu-5090 (1 test): View workflow run

cd test/ && python3 registered/spec/dflash/test_dflash.py

ggg-s · 2026-04-16T23:46:52Z

What optimizations were made on top of PR #20547? PCG?

dcw02 · 2026-04-16T23:49:34Z

What optimizations were made on top of PR #20547? PCG?

I rewrote the fused kv helper, added some new triton ops, removed some syncs, etc. PCG already exists, I did not add it.

dcw02 · 2026-04-24T15:28:26Z

I am investigating accept length degradations for both v1 and v2 paths in this PR but not in #20547

dcw02 · 2026-04-25T03:32:14Z

the accept length degradation issue has been fixed, it was a rope config handling issue when transformers version got bumped

dcw02 · 2026-04-25T08:46:19Z

so i realized we can carry reserved kv allocation metadata through overlap draft state and let next-step prep use the prepared allocation watermark, we could get rid of a scheduling bubble that helps low concurrency a lot. for correctness, scheduler output processing applies the request watermark monotonically later.

prior v2 baseline:

DFLASH output tok/s
tp\conc       1         32
-------  ------  ---------
      1  975.04  12,956.91

decoupling:

DFLASH output tok/s
tp\conc         1         32
-------  --------  ---------
      1  1,073.64  12,995.40

ggg-s · 2026-04-27T08:01:18Z

@dcw02 i encountered a error:
ValueError: Speculative decoding for Qwen3_5ForConditionalGeneration is not compatible with radix cache when using --mamba-scheduler-strategy no_buffer.To use radix cache with speculative decoding, please use --mamba-scheduler-strategy extra_buffer and set SGLANG_ENABLE_SPEC_V2=1. How can I solve this problem?

liusy58 · 2026-04-27T09:40:47Z

@dcw02 Great work! How do Dflash and Eagle3 stack up against each other in terms of performance? Do you have any current data on this?

tugot17 · 2026-04-27T13:38:06Z

@dcw02 I would like to add support for LFM to this PR.
If I make a PR to your branch introducing the changes could you merge it?

dcw02 · 2026-04-27T15:22:25Z

@ggg-s you can either disable radix cache or set --mamba-scheduler-strategy extra_buffer which should work for v1 and v2 dflash. there might be some concurrency clamping as I removed the mamba memory calculations for another PR

dcw02 · 2026-04-27T15:23:33Z

@liusy58 in my testing dflash is faster for my use cases, but both can be very good depending on how well you train the draft models

dcw02 · 2026-04-27T15:24:08Z

@tugot17 yes, we can merge it after spec v2 dflash is merged. thanks for your contribution!

liusy58 · 2026-04-27T15:30:56Z

@dcw02 Thank you for your reply. Can we chat on slack?

tugot17 · 2026-04-27T16:13:09Z

@dcw02
I added the LFM changes, but if it will be easier to add it after the DFLash is merge to main in the first place than let's wait

https://github.com/sgl-project/sglang/pull/23847/changes

liusy58 · 2026-04-28T09:34:17Z

@dcw02 Could you please resolve these merge conflicts?

…h-spec-v2

dcw02 · 2026-04-28T22:31:04Z

@liusy58 fixed merged conflicts

dcw02 · 2026-05-01T05:03:21Z

I will put up separate PRs for the draft swa layers and gemma 4 support so they can be merged in first for v1

…roject#23000 Cherry-picked the two files needed for smcsd's DFlash direct-load path: - python/sglang/srt/models/dflash.py (DFlashDraftModel + DFlashDecoderLayer) - python/sglang/srt/speculative/dflash_utils.py (helpers used by the model) Copied from sglang upstream PR refs/pull/23000/head, which is the canonical implementation of DFlash speculative decoding referenced by checkpoints like z-lab/Qwen3.6-27B-DFlash. Adding the model class to our branch lets smcsd's _init_dflash_direct load DFlash drafts directly via sglang's class registry instead of transformers' trust_remote_code (which would 404 on dflash.py). The other DFlash files in PR sgl-project#23000 (dflash_worker, dflash_info, dflash_accept_bonus, etc.) are sglang-side speculative decoding scaffolding not used by smcsd's SMC-DFlash worker.

ggg-s · 2026-05-07T11:56:36Z

hi @dcw02 Is the current PR compatible with DFLASH + FlashInfer + mixed batches?

dcw02 · 2026-05-07T17:58:07Z

hi @dcw02 Is the current PR compatible with DFLASH + FlashInfer + mixed batches?

I haven't tested that myself so I'm unsure

ggg-s · 2026-05-08T03:40:59Z

hi @dcw02 Can the current PCG be used?

Qiaolin-Yu · 2026-05-08T20:47:01Z

+class TestDFlashServerSpecV2(TestDFlashServerBase):
+    spec_v2 = True
+
+    @unittest.skip


qq: why do we need to skip this?

Qiaolin-Yu · 2026-05-08T20:51:47Z

@@ -26,6 +28,8 @@ class TestDFlashServerBase(CustomTestCase, MatchedStopMixin, GSM8KMixin):
    attention_backend = "flashinfer"


qq: Does dflash only support flashinfer?

Qiaolin-Yu · 2026-05-08T20:55:12Z

@@ -110,6 +97,23 @@ def _lazy_init_buf(self, draft_input: EagleDraftInput):
            device=self.device,
        )

+        if self.spec_algo.is_dflash():


nit: I prefer adding a more general function (something like need_topk) instead of checking whether it's dflash here. What do you think?

Qiaolin-Yu · 2026-05-08T22:37:01Z

-            logger.warning(
-                "Overlap scheduler is disabled when using DFLASH speculative decoding (spec v2 is not supported yet)."
-            )
+            if envs.SGLANG_ENABLE_SPEC_V2.get():


spec v2 is opened by default. the logic here may need to be changed

dcw02 added 6 commits April 16, 2026 19:19

feat(spec): add dflash spec v2

70bdac0

remove benchmark sweep

9e87ef4

remove dflash spec v2 specific env

c0a329d

clean up

de0372a

remove mamba memory calculations

a722fee

update test for spec v2 and overlap plan streams

ad6f0bb

dcw02 requested review from Qiaolin-Yu, Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 16, 2026 22:51

dcw02 mentioned this pull request Apr 16, 2026

[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support #20547

Closed

5 tasks

dcw02 added the run-ci label Apr 16, 2026

Qiaolin-Yu mentioned this pull request Apr 17, 2026

Speculative Decoding Development Roadmap (2026 Q2) #23005

Open

11 tasks

dcw02 added 2 commits April 25, 2026 00:43

fix dflash rope config for transformers v5

7ea9fbd

small cleanup

6465189

decouple dflash v2 next step planning from lagging host metadata

ce09806

draft swa layer support

89a4a26

dcw02 requested review from Fridge003, HaiShaw and hebiao064 as code owners April 26, 2026 03:17

fix dflash swa flashinfer

2b0f324

tugot17 mentioned this pull request Apr 27, 2026

[Feature] DFlash speculative decoding support for LFM2 hybrid models #23847

Open

4 tasks

Merge branch 'main' of github.com:sgl-project/sglang into dcw02/dflas…

f39d86d

…h-spec-v2

dcw02 requested a review from kpham-sgl as a code owner May 1, 2026 04:56

gemma 4 support

9893ef8

dcw02 force-pushed the dcw02/dflash-spec-v2 branch from 8ae7dd3 to 9893ef8 Compare May 1, 2026 05:01

Qiaolin-Yu reviewed May 8, 2026

View reviewed changes

		@@ -26,6 +28,8 @@ class TestDFlashServerBase(CustomTestCase, MatchedStopMixin, GSM8KMixin):
		attention_backend = "flashinfer"

Conversation

dcw02 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmarks

Uh oh!

gemini-code-assist Bot commented Apr 16, 2026

Uh oh!

dcw02 commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

ggg-s commented Apr 16, 2026

Uh oh!

dcw02 commented Apr 16, 2026

Uh oh!

dcw02 commented Apr 24, 2026

Uh oh!

dcw02 commented Apr 25, 2026

Uh oh!

dcw02 commented Apr 25, 2026

Uh oh!

ggg-s commented Apr 27, 2026

Uh oh!

liusy58 commented Apr 27, 2026

Uh oh!

tugot17 commented Apr 27, 2026

Uh oh!

dcw02 commented Apr 27, 2026

Uh oh!

dcw02 commented Apr 27, 2026

Uh oh!

dcw02 commented Apr 27, 2026

Uh oh!

liusy58 commented Apr 27, 2026

Uh oh!

tugot17 commented Apr 27, 2026

Uh oh!

liusy58 commented Apr 28, 2026

Uh oh!

dcw02 commented Apr 28, 2026

Uh oh!

dcw02 commented May 1, 2026

Uh oh!

ggg-s commented May 7, 2026

Uh oh!

dcw02 commented May 7, 2026

Uh oh!

ggg-s commented May 8, 2026

Uh oh!

Qiaolin-Yu May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dcw02 commented Apr 16, 2026 •

edited

Loading