Skip to content

[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support#20547

Closed
dcw02 wants to merge 74 commits intosgl-project:mainfrom
modal-labs:dflash_v2
Closed

[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support#20547
dcw02 wants to merge 74 commits intosgl-project:mainfrom
modal-labs:dflash_v2

Conversation

@dcw02
Copy link
Copy Markdown
Collaborator

@dcw02 dcw02 commented Mar 13, 2026

Motivation

Add spec v2 path for DFlash. Should be merged after #16818

TLDR
B200, GSM8K, qwen3-8b, tp size 1, concurrency 32, max new tokens 2k, greedy decoding
9,688.26 tok/s -> 12,360.49 tok/s

Modifications

Adds v2 worker and related files

Accuracy and Benchmarks

Tested on a gcp b200 machine

Commands:

# regular v1
python benchmark/dflash/bench_dflash_gsm8k_sweep.py --tp-sizes 1 --concurrencies 32 --attention-backends trtllm_mha --speculative-draft-attention-backend fa4 --page-size 64 --skip-baseline

# overlap scheduling (spec v2)
SGLANG_ENABLE_SPEC_V2=1 SGLANG_ENABLE_DFLASH_SPEC_V2=1 SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 python benchmark/dflash/bench_dflash_gsm8k_sweep.py --tp-sizes 1 --concurrencies 32 --attention-backends trtllm_mha --speculative-draft-attention-backend fa4 --page-size 64 --skip-baseline

v1 performance

=== DFLASH GSM8K Sweep Summary ===
target_model=Qwen/Qwen3-8B
draft_model=z-lab/Qwen3-8B-DFlash-b16
max_new_tokens=2048
sampling=temperature:0.0, top_p:1.0, top_k:1
attention_backends=trtllm_mha
speculative_draft_attention_backend=fa4
speculative_dflash_draft_window_size=None
tp_sizes=1
concurrencies=32
questions_per_concurrency_base=128
device_sm=100
skip_baseline=True

=== Backend: trtllm_mha ===

Baseline output tok/s
tp\conc   32
-------  ---
      1  N/A

Baseline accuracy
tp\conc   32
-------  ---
      1  N/A

DFLASH output tok/s
tp\conc        32
-------  --------
      1  9,688.26

DFLASH accuracy
tp\conc     32
-------  -----
      1  0.850

Speedup (DFLASH / baseline)
tp\conc   32
-------  ---
      1  N/A

DFLASH acceptance length (mean spec_accept_length)
tp\conc     32
-------  -----
      1  6.470

overlap scheduling (spec v2) performance

=== DFLASH GSM8K Sweep Summary ===
target_model=Qwen/Qwen3-8B
draft_model=z-lab/Qwen3-8B-DFlash-b16
max_new_tokens=2048
sampling=temperature:0.0, top_p:1.0, top_k:1
attention_backends=trtllm_mha
speculative_draft_attention_backend=fa4
speculative_dflash_draft_window_size=None
tp_sizes=1
concurrencies=32
questions_per_concurrency_base=128
device_sm=100
skip_baseline=True

=== Backend: trtllm_mha ===

Baseline output tok/s
tp\conc   32
-------  ---
      1  N/A

Baseline accuracy
tp\conc   32
-------  ---
      1  N/A

DFLASH output tok/s
tp\conc         32
-------  ---------
      1  12,360.49

DFLASH accuracy
tp\conc     32
-------  -----
      1  0.850

Speedup (DFLASH / baseline)
tp\conc   32
-------  ---
      1  N/A

DFLASH acceptance length (mean spec_accept_length)
tp\conc     32
-------  -----
      1  6.467

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@dcw02 dcw02 requested review from ch-wan and fzyzcjy as code owners April 7, 2026 23:31
@ggg-s
Copy link
Copy Markdown

ggg-s commented Apr 9, 2026

@dcw02 Does it currently support PCG?

@dcw02
Copy link
Copy Markdown
Collaborator Author

dcw02 commented Apr 9, 2026

@dcw02 Does it currently support PCG?

i've enabled it without issues with --enforce-piecewise-cuda-graph

@dcw02
Copy link
Copy Markdown
Collaborator Author

dcw02 commented Apr 9, 2026

i'm closing this PR and reopening it soon, from another branch. have some extra improvements

@ggg-s
Copy link
Copy Markdown

ggg-s commented Apr 13, 2026

@dcw02 Have you observed any measurable performance improvements after enabling --enforce-piecewise-cuda-graph?

@moehanabi
Copy link
Copy Markdown

@dcw02 Hi! Thanks for your great work!
Found lower accept len than spec v1. Do you get the same result?

@dcw02
Copy link
Copy Markdown
Collaborator Author

dcw02 commented Apr 14, 2026

@dcw02 Hi! Thanks for your great work! Found lower accept len than spec v1. Do you get the same result?

i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run

@dcw02
Copy link
Copy Markdown
Collaborator Author

dcw02 commented Apr 14, 2026

@dcw02 Have you observed any measurable performance improvements after enabling --enforce-piecewise-cuda-graph?

yes it improves performance, some models more than others. it especially improves qwen3.5

@moehanabi
Copy link
Copy Markdown

@dcw02 Hi! Thanks for your great work! Found lower accept len than spec v1. Do you get the same result?

i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run

Thanks for your reply.
That's strange because i test on temperature = 0 and on your dflash_v2_experimental branch

@dcw02
Copy link
Copy Markdown
Collaborator Author

dcw02 commented Apr 15, 2026

@dcw02 Hi! Thanks for your great work! Found lower accept len than spec v1. Do you get the same result?

i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run

Thanks for your reply. That's strange because i test on temperature = 0 and on your dflash_v2_experimental branch

can you give me a repro script

@moehanabi
Copy link
Copy Markdown

moehanabi commented Apr 16, 2026

@dcw02 Hi! Thanks for your great work! Found lower accept len than spec v1. Do you get the same result?

i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run

Thanks for your reply. That's strange because i test on temperature = 0 and on your dflash_v2_experimental branch

can you give me a repro script

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path "/workspace/data/model/qwen3-30b-a3b-fp8"  --port "8001" --host "0.0.0.0" --tp-size "1" --dp-size "1" --base-gpu-id "0" --mem-fraction-static "0.7" --attention-backend "fa3" --sampling-backend "flashinfer" --chunked-prefill-size "-1" --max-prefill-tokens "16384" --max-running-requests 128 --stream-interval 1 --enable-metrics  --speculative-algorithm DFLASH --speculative-num-steps 15 --speculative-eagle-topk 1 --speculative-num-draft-tokens 16 --speculative-draft-model-path /workspace/data/model/qwen3_dflash --enforce-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8000 --context-length 65535 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}' --nccl-port 8400 --cuda-graph-max-bs 128

at this commit

request:

{"messages": [{"content": "prompt", "role": "user"}], "stream": true, "temperature": 0, "top_p": 0.9, "penalty_score": 1.1, "chat_template_kwargs": {"enable_thinking": false}, "max_tokens": 512, "stream_options": {"include_usage": true}}

PS: dflash model is trained with specforge by myself because my need is to generate only json with about 30 tokens. following is my dflash model config:

{
  "architectures": [
    "DFlashDraftModel"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoModel": "dflash.DFlashDraftModel"
  },
  "block_size": 16,
  "bos_token_id": 151643,
  "dflash_config": {
    "mask_token_id": 151669,
    "target_layer_ids": [1, 12, 23, 34, 45]
  },
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 6144,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 131072,
  "max_window_layers": 8,
  "model_type": "qwen3",
  "num_attention_heads": 32,
  "num_hidden_layers": 8,
  "num_key_value_heads": 4,
  "num_target_layers": 48,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

@dcw02
Copy link
Copy Markdown
Collaborator Author

dcw02 commented Apr 16, 2026

@moehanabi can you turn off piecewise cuda graphs and retest? that feature is experimental in general and I noticed some correctness issues with it on in my testing sometimes

@dcw02
Copy link
Copy Markdown
Collaborator Author

dcw02 commented Apr 16, 2026

cleaned up and moved to #23000

@dcw02 dcw02 closed this Apr 16, 2026
@moehanabi
Copy link
Copy Markdown

@moehanabi can you turn off piecewise cuda graphs and retest? that feature is experimental in general and I noticed some correctness issues with it on in my testing sometimes

yes.. you're right.
but pcg is not the only reason. i tested on 4 config:

spec v2 + pcg: 7.44
spec v1 + pcg: 8.01
spec v2: 7.89
spec v1: 8.02

@dcw02
Copy link
Copy Markdown
Collaborator Author

dcw02 commented Apr 17, 2026

@moehanabi can you turn off piecewise cuda graphs and retest? that feature is experimental in general and I noticed some correctness issues with it on in my testing sometimes

yes.. you're right. but pcg is not the only reason. i tested on 4 config:

spec v2 + pcg: 7.44 spec v1 + pcg: 8.01 spec v2: 7.89 spec v1: 8.02

thanks, i will try to reproduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants