[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support#20547
[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support#20547dcw02 wants to merge 74 commits intosgl-project:mainfrom
Conversation
…ock-size to server args
…pec v2 and overlap plan stream), needs clean-up
|
@dcw02 Does it currently support PCG? |
i've enabled it without issues with |
|
i'm closing this PR and reopening it soon, from another branch. have some extra improvements |
|
@dcw02 Have you observed any measurable performance improvements after enabling --enforce-piecewise-cuda-graph? |
|
@dcw02 Hi! Thanks for your great work! |
i haven't found lower accept length than spec v1 in greedy, there might be some minute differences since we have some extra optimizations. if your requests are temperature > 0 then there will also be accept length differences run to run |
yes it improves performance, some models more than others. it especially improves qwen3.5 |
Thanks for your reply. |
can you give me a repro script |
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path "/workspace/data/model/qwen3-30b-a3b-fp8" --port "8001" --host "0.0.0.0" --tp-size "1" --dp-size "1" --base-gpu-id "0" --mem-fraction-static "0.7" --attention-backend "fa3" --sampling-backend "flashinfer" --chunked-prefill-size "-1" --max-prefill-tokens "16384" --max-running-requests 128 --stream-interval 1 --enable-metrics --speculative-algorithm DFLASH --speculative-num-steps 15 --speculative-eagle-topk 1 --speculative-num-draft-tokens 16 --speculative-draft-model-path /workspace/data/model/qwen3_dflash --enforce-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8000 --context-length 65535 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}' --nccl-port 8400 --cuda-graph-max-bs 128at this commit request: {"messages": [{"content": "prompt", "role": "user"}], "stream": true, "temperature": 0, "top_p": 0.9, "penalty_score": 1.1, "chat_template_kwargs": {"enable_thinking": false}, "max_tokens": 512, "stream_options": {"include_usage": true}}PS: dflash model is trained with specforge by myself because my need is to generate only json with about 30 tokens. following is my dflash model config: {
"architectures": [
"DFlashDraftModel"
],
"attention_bias": false,
"attention_dropout": 0.0,
"auto_map": {
"AutoModel": "dflash.DFlashDraftModel"
},
"block_size": 16,
"bos_token_id": 151643,
"dflash_config": {
"mask_token_id": 151669,
"target_layer_ids": [1, 12, 23, 34, 45]
},
"dtype": "bfloat16",
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 6144,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"max_window_layers": 8,
"model_type": "qwen3",
"num_attention_heads": 32,
"num_hidden_layers": 8,
"num_key_value_heads": 4,
"num_target_layers": 48,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": false,
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
} |
|
@moehanabi can you turn off piecewise cuda graphs and retest? that feature is experimental in general and I noticed some correctness issues with it on in my testing sometimes |
|
cleaned up and moved to #23000 |
yes.. you're right. spec v2 + pcg: 7.44 |
thanks, i will try to reproduce |
Motivation
Add spec v2 path for DFlash. Should be merged after #16818
TLDR
B200, GSM8K, qwen3-8b, tp size 1, concurrency 32, max new tokens 2k, greedy decoding
9,688.26 tok/s->12,360.49 tok/sModifications
Adds v2 worker and related files
Accuracy and Benchmarks
Tested on a gcp b200 machine
Commands:
v1 performance
overlap scheduling (spec v2) performance
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci