Skip to content

Allow piecewise CUDA graph with speculative decoding#22128

Merged
ispobock merged 12 commits intosgl-project:mainfrom
narutolhy:feat/enable-pcg-with-mtp
Apr 17, 2026
Merged

Allow piecewise CUDA graph with speculative decoding#22128
ispobock merged 12 commits intosgl-project:mainfrom
narutolhy:feat/enable-pcg-with-mtp

Conversation

@narutolhy
Copy link
Copy Markdown
Contributor

@narutolhy narutolhy commented Apr 5, 2026

Summary

  • Allow --enable-piecewise-cuda-graph to coexist with all speculative decoding algorithms (EAGLE/EAGLE3/NEXTN/STANDALONE/NGRAM)
  • Previously all speculative algorithms disabled PCG (Piecewise Cuda Graph set default #16331 added this as a safety measure when PCG became default, without testing spec decode compatibility)

Motivation

PCG and speculative decoding operate on independent forward paths:

  • PCG: captures/replays graphs for prefill/extend (ForwardMode.EXTEND) with spec_info=None
  • Speculative: draft/verify uses decode CUDA graphs (ForwardMode.TARGET_VERIFY)

The restriction was added in #16331 as a conservative safety measure when PCG became default-enabled. The original PCG implementation (#10062) had no speculative restriction.

Accuracy Verification (GSM8K, 50 questions per config)

Algorithm Model PCG Score No-PCG Score Acceptance Status
EAGLE/NEXTN Qwen3.5-35B-A3B FP8 TP2 0.980 0.980 3.46 ✅ Identical
EAGLE3 Qwen3-30B-A3B TP2 1.000 4.23
STANDALONE Qwen3-8B FP8 TP2 0.380 0.400 3.11 ✅ Within noise

STANDALONE's lower score (0.56→0.40) is from the weak draft model (Qwen3-0.6B), equally present with and without PCG.

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100)

--mamba-scheduler-strategy extra_buffer for all configs.

Config rate=1 ITL rate=5 TTFT rate=5 ITL
Baseline (no overlap, no PCG, no MTP) 21.95ms
extra_buffer + MTP 3.00ms 253ms 6.06ms
extra_buffer + PCG + MTP (this PR) 2.89ms 147ms 5.56ms

PCG adds prefill acceleration on top of MTP's decode speedup:

  • TTFT (rate=5): 253ms → 147ms (-42%)

Modification

server_args.py: Removed the blanket disable of PCG for speculative decoding. Added comments explaining why PCG and speculative decoding are compatible (independent forward paths).

Test plan

  • EAGLE/NEXTN (MTP) + PCG: accuracy 0.980 (= baseline)
  • EAGLE3 + PCG: accuracy 1.000
  • STANDALONE + PCG: accuracy 0.380 (= no-PCG 0.400, within noise)
  • Added test: test/registered/piecewise_cuda_graph/test_pcg_with_mtp.py
    • TestPCGWithMTP: Qwen3.5-35B-A3B + NEXTN + PCG
    • TestPCGWithEAGLE3: Qwen3-30B-A3B + EAGLE3 + PCG
    • TestPCGWithSTANDALONE: Qwen3-8B + STANDALONE + PCG
  • CI tests

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the server arguments to allow the 'NEXTN' speculative decoding algorithm to work with piecewise CUDA graphs, as they do not conflict. The review feedback suggests using a set for checking compatible algorithms to enhance code maintainability and extensibility.

Comment thread python/sglang/srt/server_args.py Outdated
Comment on lines 1073 to 1077
if (
self.speculative_algorithm is not None
and self.speculative_algorithm != "NEXTN"
):
self.disable_piecewise_cuda_graph = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better maintainability and to make it easier to add other compatible speculative decoding algorithms in the future, consider using a set for the check. This makes the intent clearer and the code more extensible.

Suggested change
if (
self.speculative_algorithm is not None
and self.speculative_algorithm != "NEXTN"
):
self.disable_piecewise_cuda_graph = True
if (
self.speculative_algorithm is not None
and self.speculative_algorithm not in {"NEXTN"}
):
self.disable_piecewise_cuda_graph = True

@narutolhy
Copy link
Copy Markdown
Contributor Author

Closing: found that PCG + speculative decoding causes accuracy degradation (GSM8K 12.5% vs expected >75%). The original restriction was correct. PCG captures with spec_info=None but the model forward path behaves differently with speculative decoding active, causing incorrect outputs.

@narutolhy narutolhy closed this Apr 5, 2026
@narutolhy
Copy link
Copy Markdown
Contributor Author

Reopening: the accuracy issue was caused by missing --reasoning-parser qwen3 in the test, not PCG/MTP incompatibility.

Re-tested with proper reasoning parser on a fresh machine (2xH100):

Config GSM8K (50q) MTP Acceptance
Baseline (extra_buffer) 0.980 N/A
MTP alone 0.980 3.44
PCG + MTP 0.980 3.46

All three configs produce identical accuracy. PCG and MTP are fully compatible.

Updated the test to include --reasoning-parser qwen3 and thinking_mode='qwen3'.

@narutolhy narutolhy reopened this Apr 5, 2026
@narutolhy narutolhy force-pushed the feat/enable-pcg-with-mtp branch from e72639b to ae676ac Compare April 5, 2026 01:21
@github-actions github-actions Bot added documentation Improvements or additions to documentation deepseek labels Apr 5, 2026
@narutolhy narutolhy force-pushed the feat/enable-pcg-with-mtp branch from ae676ac to 9112570 Compare April 5, 2026 01:21
@narutolhy
Copy link
Copy Markdown
Contributor Author

Extended Verification — All Speculative Algorithms

Tested PCG compatibility with every available speculative algorithm:

Algorithm Model PCG Score No-PCG Score Acceptance Status
EAGLE/NEXTN (MTP) Qwen3.5-35B-A3B FP8 TP2 0.980 0.980 3.46 ✅ Identical
EAGLE3 Qwen3-30B-A3B TP2 1.000 4.23
STANDALONE Qwen3-8B FP8 TP2 0.380 0.400 3.11 ✅ Within noise (1 question / 50)
NGRAM Qwen3-8B N/A N/A ⚠️ C++ compile error in v0.5.10rc0 (unrelated to PCG)

Conclusion: PCG has zero impact on accuracy across all tested speculative algorithms. The STANDALONE score drop (0.560→0.400) is from the weak draft model (Qwen3-0.6B), equally present with and without PCG.

@narutolhy narutolhy force-pushed the feat/enable-pcg-with-mtp branch 4 times, most recently from 9a103bf to 4b89c5e Compare April 5, 2026 05:00
@narutolhy narutolhy force-pushed the feat/enable-pcg-with-mtp branch from 4b89c5e to c1602df Compare April 5, 2026 05:06
@narutolhy narutolhy requested a review from hebiao064 as a code owner April 5, 2026 05:06
@narutolhy narutolhy force-pushed the feat/enable-pcg-with-mtp branch 9 times, most recently from c36a8f4 to 528f957 Compare April 5, 2026 07:49
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@narutolhy narutolhy force-pushed the feat/enable-pcg-with-mtp branch from 528f957 to 2015039 Compare April 5, 2026 07:59
@narutolhy narutolhy changed the title Allow piecewise CUDA graph with MTP (NEXTN) speculative decoding Allow piecewise CUDA graph with speculative decoding Apr 5, 2026
@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

8 similar comments
@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@nvpohanh
Copy link
Copy Markdown
Collaborator

cc @YAMY1234 @hlu1 @nvjullin for vis

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

1 similar comment
@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@cs-cat
Copy link
Copy Markdown
Contributor

cs-cat commented Apr 16, 2026

Hi @narutolhy , I tested this PR on Qwen3.5-27B. However, there appears to be some performance degradation.
server command:

SGLANG_ENABLE_SPEC_V2=True python -m sglang.launch_server         --host 0.0.0.0 --port 30001         --model-path /models/Qwen3.5-27B-FP8         --served-model-name qwen         --mem-fraction-static 0.86         --context-length 262144         --reasoning-parser qwen3         --tool-call-parser qwen3_coder         --speculative-algo NEXTN         --speculative-num-steps 3         --speculative-eagle-topk 1         --speculative-num-draft-tokens 4         --max-running-requests 6         --trust-remote-code         --sleep-on-idle         --kv-cache-dtype fp8_e4m3         --enable-multimodal         --tp-size 2         --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --mamba-scheduler-strategy extra_buffer

benchmark command:

python3 -m sglang.bench_serving --backend sglang --num-prompt 256 --port 30001 --max-concurrency 1 --seed 2 --served-model-name qwen --model /models/Qwen3.5-27B-FP8

before PR:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  642.09    
Total input tokens:                      90070     
Total input text tokens:                 90070     
Total generated tokens:                  50148     
Total generated tokens (retokenized):    50138     
Request throughput (req/s):              0.40      
Input token throughput (tok/s):          140.28    
Output token throughput (tok/s):         78.10     
Peak output token throughput (tok/s):    115.00    
Peak concurrent requests:                4         
Total token throughput (tok/s):          218.38    
Concurrency:                             1.00      
Accept length:                           2.98      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2507.06   
Median E2E Latency (ms):                 1616.40   
P90 E2E Latency (ms):                    5882.81   
P99 E2E Latency (ms):                    10284.57  
---------------Time to First Token----------------
Mean TTFT (ms):                          198.80    
Median TTFT (ms):                        162.42    
P99 TTFT (ms):                           818.03    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.04     
Median TPOT (ms):                        12.36     
P99 TPOT (ms):                           28.52     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.84     
Median ITL (ms):                         8.75      
P95 ITL (ms):                            34.76     
P99 ITL (ms):                            35.04     
Max ITL (ms):                            45.31     
==================================================

after PR:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  671.34    
Total input tokens:                      90070     
Total input text tokens:                 90070     
Total generated tokens:                  50148     
Total generated tokens (retokenized):    50138     
Request throughput (req/s):              0.38      
Input token throughput (tok/s):          134.16    
Output token throughput (tok/s):         74.70     
Peak output token throughput (tok/s):    115.00    
Peak concurrent requests:                4         
Total token throughput (tok/s):          208.86    
Concurrency:                             1.00      
Accept length:                           2.98      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2621.29   
Median E2E Latency (ms):                 1735.10   
P90 E2E Latency (ms):                    6057.46   
P99 E2E Latency (ms):                    10436.05  
---------------Time to First Token----------------
Mean TTFT (ms):                          312.83    
Median TTFT (ms):                        272.58    
P99 TTFT (ms):                           1264.48   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.03     
Median TPOT (ms):                        12.37     
P99 TPOT (ms):                           28.48     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.84     
Median ITL (ms):                         8.75      
P95 ITL (ms):                            34.74     
P99 ITL (ms):                            35.04     
Max ITL (ms):                            45.79     
==================================================

@narutolhy
Copy link
Copy Markdown
Contributor Author

hi @cs-cat
Thanks for testing this. I'll try to reproduce it and take a look.
With piecewise CUDA graphs, the initial few runs tend to be slower; they require a thorough warmup. This might be the reason. I'll capture a profile to take a look.

@cs-cat
Copy link
Copy Markdown
Contributor

cs-cat commented Apr 16, 2026

Hi @narutolhy , I tested this on a newer commit (with HiCache disabled due to a known bug) and found that this PR resulted in a performance improvement.
Previous tests were conducted on commit 0011d2a plus some performance patches/bugfixes ported from main/PR. The performance degradation is likely due to other issues/noise.

commit a4cf2ea1:
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  667.91    
Total input tokens:                      90070     
Total input text tokens:                 90070     
Total generated tokens:                  50148     
Total generated tokens (retokenized):    50135     
Request throughput (req/s):              0.38      
Input token throughput (tok/s):          134.85    
Output token throughput (tok/s):         75.08     
Peak output token throughput (tok/s):    115.00    
Peak concurrent requests:                4         
Total token throughput (tok/s):          209.93    
Concurrency:                             1.00      
Accept length:                           2.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2608.07   
Median E2E Latency (ms):                 1758.14   
P90 E2E Latency (ms):                    6044.83   
P99 E2E Latency (ms):                    10340.22  
---------------Time to First Token----------------
Mean TTFT (ms):                          292.48    
Median TTFT (ms):                        255.93    
P99 TTFT (ms):                           1202.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.11     
Median TPOT (ms):                        12.37     
P99 TPOT (ms):                           32.81     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.88     
Median ITL (ms):                         8.73      
P95 ITL (ms):                            34.71     
P99 ITL (ms):                            34.91     
Max ITL (ms):                            53.48     
==================================================

with PR:
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  639.93    
Total input tokens:                      90070     
Total input text tokens:                 90070     
Total generated tokens:                  50148     
Total generated tokens (retokenized):    50135     
Request throughput (req/s):              0.40      
Input token throughput (tok/s):          140.75    
Output token throughput (tok/s):         78.36     
Peak output token throughput (tok/s):    115.00    
Peak concurrent requests:                5         
Total token throughput (tok/s):          219.11    
Concurrency:                             1.00      
Accept length:                           2.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2498.69   
Median E2E Latency (ms):                 1616.84   
P90 E2E Latency (ms):                    5964.01   
P99 E2E Latency (ms):                    10239.41  
---------------Time to First Token----------------
Mean TTFT (ms):                          181.40    
Median TTFT (ms):                        151.64    
P99 TTFT (ms):                           797.77    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.13     
Median TPOT (ms):                        12.39     
P99 TPOT (ms):                           32.86     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.89     
Median ITL (ms):                         8.73      
P95 ITL (ms):                            34.77     
P99 ITL (ms):                            34.94     
Max ITL (ms):                            38.06     
==================================================

@narutolhy
Copy link
Copy Markdown
Contributor Author

Hi @cs-cat Thank you again. I will try to merge it soon.

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small modification comment. Otherwise good to me

Comment thread python/sglang/srt/server_args.py Outdated
@Oasis-Git
Copy link
Copy Markdown
Collaborator

CI Pass

@ispobock ispobock merged commit 5fa0c6a into sgl-project:main Apr 17, 2026
60 of 87 checks passed
jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Chen-0210
Copy link
Copy Markdown
Contributor

Chen-0210 commented May 8, 2026

Hi, @narutolhy. Did your MTP+pcg benchmark run with --enable-return-hidden-states to enable full capture_mode? I noticed that the default value of capture_mode in PCG capture is null, so does that mean PCG is not enabled by default when enable mtp?

f forward_batch.capture_hidden_mode != self.capture_hidden_mode:
            return False

@narutolhy
Copy link
Copy Markdown
Contributor Author

Hi @Chen-0210
You're right that with default flags, the MTP+PCG run isn't actually exercising PCG on the target prefill — my earlier numbers need to be re-checked under that lens. Sorry for the noise.

I'll work on a follow-up to make this actually work with MTP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation run-ci speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants