[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support by dcw02 · Pull Request #20547 · sgl-project/sglang

dcw02 · 2026-03-13T20:58:33Z

Motivation

Add spec v2 path for DFlash. Should be merged after #16818

TLDR
B200, GSM8K, qwen3-8b, tp size 1, concurrency 32, max new tokens 2k, greedy decoding
9,688.26 tok/s -> 12,360.49 tok/s

Modifications

Adds v2 worker and related files

Accuracy and Benchmarks

Tested on a gcp b200 machine

Commands:

# regular v1
python benchmark/dflash/bench_dflash_gsm8k_sweep.py --tp-sizes 1 --concurrencies 32 --attention-backends trtllm_mha --speculative-draft-attention-backend fa4 --page-size 64 --skip-baseline

# overlap scheduling (spec v2)
SGLANG_ENABLE_SPEC_V2=1 SGLANG_ENABLE_DFLASH_SPEC_V2=1 SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 python benchmark/dflash/bench_dflash_gsm8k_sweep.py --tp-sizes 1 --concurrencies 32 --attention-backends trtllm_mha --speculative-draft-attention-backend fa4 --page-size 64 --skip-baseline

v1 performance

=== DFLASH GSM8K Sweep Summary ===
target_model=Qwen/Qwen3-8B
draft_model=z-lab/Qwen3-8B-DFlash-b16
max_new_tokens=2048
sampling=temperature:0.0, top_p:1.0, top_k:1
attention_backends=trtllm_mha
speculative_draft_attention_backend=fa4
speculative_dflash_draft_window_size=None
tp_sizes=1
concurrencies=32
questions_per_concurrency_base=128
device_sm=100
skip_baseline=True

=== Backend: trtllm_mha ===

Baseline output tok/s
tp\conc   32
-------  ---
      1  N/A

Baseline accuracy
tp\conc   32
-------  ---
      1  N/A

DFLASH output tok/s
tp\conc        32
-------  --------
      1  9,688.26

DFLASH accuracy
tp\conc     32
-------  -----
      1  0.850

Speedup (DFLASH / baseline)
tp\conc   32
-------  ---
      1  N/A

DFLASH acceptance length (mean spec_accept_length)
tp\conc     32
-------  -----
      1  6.470

overlap scheduling (spec v2) performance

=== DFLASH GSM8K Sweep Summary ===
target_model=Qwen/Qwen3-8B
draft_model=z-lab/Qwen3-8B-DFlash-b16
max_new_tokens=2048
sampling=temperature:0.0, top_p:1.0, top_k:1
attention_backends=trtllm_mha
speculative_draft_attention_backend=fa4
speculative_dflash_draft_window_size=None
tp_sizes=1
concurrencies=32
questions_per_concurrency_base=128
device_sm=100
skip_baseline=True

=== Backend: trtllm_mha ===

Baseline output tok/s
tp\conc   32
-------  ---
      1  N/A

Baseline accuracy
tp\conc   32
-------  ---
      1  N/A

DFLASH output tok/s
tp\conc         32
-------  ---------
      1  12,360.49

DFLASH accuracy
tp\conc     32
-------  -----
      1  0.850

Speedup (DFLASH / baseline)
tp\conc   32
-------  ---
      1  N/A

DFLASH acceptance length (mean spec_accept_length)
tp\conc     32
-------  -----
      1  6.467

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…ock-size to server args

…pec v2 and overlap plan stream), needs clean-up

ggg-s · 2026-04-09T07:09:06Z

@dcw02 Does it currently support PCG?

dcw02 · 2026-04-09T07:11:24Z

@dcw02 Does it currently support PCG?

i've enabled it without issues with --enforce-piecewise-cuda-graph

dcw02 · 2026-04-09T07:12:11Z

i'm closing this PR and reopening it soon, from another branch. have some extra improvements

ggg-s · 2026-04-13T11:28:35Z

@dcw02 Have you observed any measurable performance improvements after enabling --enforce-piecewise-cuda-graph?

moehanabi · 2026-04-14T08:17:40Z

@dcw02 Hi! Thanks for your great work!
Found lower accept len than spec v1. Do you get the same result?

dcw02 · 2026-04-14T16:55:35Z

@dcw02 Hi! Thanks for your great work! Found lower accept len than spec v1. Do you get the same result?

i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run

dcw02 · 2026-04-14T16:56:10Z

@dcw02 Have you observed any measurable performance improvements after enabling --enforce-piecewise-cuda-graph?

yes it improves performance, some models more than others. it especially improves qwen3.5

moehanabi · 2026-04-15T08:08:23Z

@dcw02 Hi! Thanks for your great work! Found lower accept len than spec v1. Do you get the same result?

i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run

Thanks for your reply.
That's strange because i test on temperature = 0 and on your dflash_v2_experimental branch

dcw02 · 2026-04-15T14:13:13Z

@dcw02 Hi! Thanks for your great work! Found lower accept len than spec v1. Do you get the same result?

i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run

Thanks for your reply. That's strange because i test on temperature = 0 and on your dflash_v2_experimental branch

can you give me a repro script

moehanabi · 2026-04-16T03:41:08Z

@dcw02 Hi! Thanks for your great work! Found lower accept len than spec v1. Do you get the same result?

i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run

Thanks for your reply. That's strange because i test on temperature = 0 and on your dflash_v2_experimental branch

can you give me a repro script

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path "/workspace/data/model/qwen3-30b-a3b-fp8"  --port "8001" --host "0.0.0.0" --tp-size "1" --dp-size "1" --base-gpu-id "0" --mem-fraction-static "0.7" --attention-backend "fa3" --sampling-backend "flashinfer" --chunked-prefill-size "-1" --max-prefill-tokens "16384" --max-running-requests 128 --stream-interval 1 --enable-metrics  --speculative-algorithm DFLASH --speculative-num-steps 15 --speculative-eagle-topk 1 --speculative-num-draft-tokens 16 --speculative-draft-model-path /workspace/data/model/qwen3_dflash --enforce-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8000 --context-length 65535 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}' --nccl-port 8400 --cuda-graph-max-bs 128

at this commit

request:

{"messages": [{"content": "prompt", "role": "user"}], "stream": true, "temperature": 0, "top_p": 0.9, "penalty_score": 1.1, "chat_template_kwargs": {"enable_thinking": false}, "max_tokens": 512, "stream_options": {"include_usage": true}}

PS: dflash model is trained with specforge by myself because my need is to generate only json with about 30 tokens. following is my dflash model config:

{
  "architectures": [
    "DFlashDraftModel"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoModel": "dflash.DFlashDraftModel"
  },
  "block_size": 16,
  "bos_token_id": 151643,
  "dflash_config": {
    "mask_token_id": 151669,
    "target_layer_ids": [1, 12, 23, 34, 45]
  },
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 6144,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 131072,
  "max_window_layers": 8,
  "model_type": "qwen3",
  "num_attention_heads": 32,
  "num_hidden_layers": 8,
  "num_key_value_heads": 4,
  "num_target_layers": 48,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

dcw02 · 2026-04-16T19:02:58Z

@moehanabi can you turn off piecewise cuda graphs and retest? that feature is experimental in general and I noticed some correctness issues with it on in my testing sometimes

dcw02 · 2026-04-16T22:52:43Z

cleaned up and moved to #23000

moehanabi · 2026-04-17T07:04:25Z

@moehanabi can you turn off piecewise cuda graphs and retest? that feature is experimental in general and I noticed some correctness issues with it on in my testing sometimes

yes.. you're right.
but pcg is not the only reason. i tested on 4 config:

spec v2 + pcg: 7.44
spec v1 + pcg: 8.01
spec v2: 7.89
spec v1: 8.02

dcw02 · 2026-04-17T17:22:53Z

@moehanabi can you turn off piecewise cuda graphs and retest? that feature is experimental in general and I noticed some correctness issues with it on in my testing sometimes

yes.. you're right. but pcg is not the only reason. i tested on 4 config:

spec v2 + pcg: 7.44 spec v1 + pcg: 8.01 spec v2: 7.89 spec v1: 8.02

thanks, i will try to reproduce

dcw02 added 30 commits January 6, 2026 23:24

starting dflash impl

10e563f

fix verify mismatch

289e748

add gsm8k bench

f1efc03

support more backends, investigate accuracy

e807216

native sglang backend

99e140a

remove hf backend

2c64b0e

dflash support flashinfer

f1a4262

remove manual management of dflash kv pool

2c5b346

add cuda graph

6a38e63

add cuda graph to draft worker

40a81af

update test

510bf0c

fix flashinfer backend

c54f336

initial radix cache support

8c8ee9c

tp_size > 1 support

0edea3f

add optional dflash_config for overrides, add --speculative-dflash-bl…

f23555b

…ock-size to server args

fix OOMs with default settings

63c0b9a

clean up

9309764

clean up dflash load_weights

644ab29

attention selection logic

ff6876a

Merge remote-tracking branch 'upstream/main' into dflash

32c3dd0

decouple context feature count K from draft num layers

d808ac9

clean up naming

e589ac1

performance optimizations

074efb2

skip Q, fused mlp

fcc9bf7

reuse buffers for decode

a79264f

optimize greedy sampling

ad5adbf

preallocate for tp>1

37fc3f1

more buffers for tp>1

72cbd9d

dflash gsm8k benchmark sweep

5a577a3

fix benchmark

d968532

dcw02 added 10 commits March 29, 2026 03:09

clean up dflash cuda graph runner paths

5b7ebd2

dflash spec v2 changes for cuda graph runner changes

9314e5e

clean up dflash request validation

6b6683c

clean up stop string handling

4860317

inline stop strings logic

e825041

fix cuda IMA?

c62c37c

fix cuda IMA for bs > 1 and overlap plan streams

fed6b3e

messy auto memory calculation for hybrid models for dflash (include s…

339f25e

…pec v2 and overlap plan stream), needs clean-up

update auto memory sizing

4926ca2

add dflash support for kimi k2.5

e67a0d4

dcw02 requested review from ch-wan and fzyzcjy as code owners April 7, 2026 23:31

github-actions Bot added the deepseek label Apr 7, 2026

mmangkad mentioned this pull request Apr 9, 2026

Enable DFLASH support for additional model backends #22358

Merged

dcw02 mentioned this pull request Apr 16, 2026

[Feature] Spec V2 DFlash Support #23000

Open

dcw02 closed this Apr 16, 2026

Conversation

dcw02 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy and Benchmarks

v1 performance

overlap scheduling (spec v2) performance

Checklist

Review Process

Uh oh!

ggg-s commented Apr 9, 2026

Uh oh!

dcw02 commented Apr 9, 2026

Uh oh!

dcw02 commented Apr 9, 2026

Uh oh!

ggg-s commented Apr 13, 2026

Uh oh!

moehanabi commented Apr 14, 2026

Uh oh!

dcw02 commented Apr 14, 2026

Uh oh!

dcw02 commented Apr 14, 2026

Uh oh!

moehanabi commented Apr 15, 2026

Uh oh!

dcw02 commented Apr 15, 2026

Uh oh!

moehanabi commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcw02 commented Apr 16, 2026

Uh oh!

dcw02 commented Apr 16, 2026

Uh oh!

moehanabi commented Apr 17, 2026

Uh oh!

dcw02 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dcw02 commented Mar 13, 2026 •

edited

Loading

moehanabi commented Apr 16, 2026 •

edited

Loading