Allow piecewise CUDA graph with speculative decoding by narutolhy · Pull Request #22128 · sgl-project/sglang

narutolhy · 2026-04-05T00:15:16Z

Summary

Allow --enable-piecewise-cuda-graph to coexist with all speculative decoding algorithms (EAGLE/EAGLE3/NEXTN/STANDALONE/NGRAM)
Previously all speculative algorithms disabled PCG (Piecewise Cuda Graph set default #16331 added this as a safety measure when PCG became default, without testing spec decode compatibility)

Motivation

PCG and speculative decoding operate on independent forward paths:

PCG: captures/replays graphs for prefill/extend (ForwardMode.EXTEND) with spec_info=None
Speculative: draft/verify uses decode CUDA graphs (ForwardMode.TARGET_VERIFY)

The restriction was added in #16331 as a conservative safety measure when PCG became default-enabled. The original PCG implementation (#10062) had no speculative restriction.

Accuracy Verification (GSM8K, 50 questions per config)

Algorithm	Model	PCG Score	No-PCG Score	Acceptance	Status
EAGLE/NEXTN	Qwen3.5-35B-A3B FP8 TP2	0.980	0.980	3.46	✅ Identical
EAGLE3	Qwen3-30B-A3B TP2	1.000	—	4.23	✅
STANDALONE	Qwen3-8B FP8 TP2	0.380	0.400	3.11	✅ Within noise

STANDALONE's lower score (0.56→0.40) is from the weak draft model (Qwen3-0.6B), equally present with and without PCG.

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100)

--mamba-scheduler-strategy extra_buffer for all configs.

Config	rate=1 ITL	rate=5 TTFT	rate=5 ITL
Baseline (no overlap, no PCG, no MTP)	21.95ms	—	—
extra_buffer + MTP	3.00ms	253ms	6.06ms
extra_buffer + PCG + MTP (this PR)	2.89ms	147ms	5.56ms

PCG adds prefill acceleration on top of MTP's decode speedup:

TTFT (rate=5): 253ms → 147ms (-42%)

Modification

server_args.py: Removed the blanket disable of PCG for speculative decoding. Added comments explaining why PCG and speculative decoding are compatible (independent forward paths).

Test plan

EAGLE/NEXTN (MTP) + PCG: accuracy 0.980 (= baseline)
EAGLE3 + PCG: accuracy 1.000
STANDALONE + PCG: accuracy 0.380 (= no-PCG 0.400, within noise)
Added test: test/registered/piecewise_cuda_graph/test_pcg_with_mtp.py
- TestPCGWithMTP: Qwen3.5-35B-A3B + NEXTN + PCG
- TestPCGWithEAGLE3: Qwen3-30B-A3B + EAGLE3 + PCG
- TestPCGWithSTANDALONE: Qwen3-8B + STANDALONE + PCG
CI tests

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request updates the server arguments to allow the 'NEXTN' speculative decoding algorithm to work with piecewise CUDA graphs, as they do not conflict. The review feedback suggests using a set for checking compatible algorithms to enhance code maintainability and extensibility.

gemini-code-assist · 2026-04-05T00:17:44Z

+        if (
+            self.speculative_algorithm is not None
+            and self.speculative_algorithm != "NEXTN"
+        ):
            self.disable_piecewise_cuda_graph = True


For better maintainability and to make it easier to add other compatible speculative decoding algorithms in the future, consider using a set for the check. This makes the intent clearer and the code more extensible.

Suggested change

if (

self.speculative_algorithm is not None

and self.speculative_algorithm != "NEXTN"

):

self.disable_piecewise_cuda_graph = True

if (

self.speculative_algorithm is not None

and self.speculative_algorithm not in {"NEXTN"}

):

self.disable_piecewise_cuda_graph = True

narutolhy · 2026-04-05T00:31:17Z

Closing: found that PCG + speculative decoding causes accuracy degradation (GSM8K 12.5% vs expected >75%). The original restriction was correct. PCG captures with spec_info=None but the model forward path behaves differently with speculative decoding active, causing incorrect outputs.

narutolhy · 2026-04-05T01:20:12Z

Reopening: the accuracy issue was caused by missing --reasoning-parser qwen3 in the test, not PCG/MTP incompatibility.

Re-tested with proper reasoning parser on a fresh machine (2xH100):

Config	GSM8K (50q)	MTP Acceptance
Baseline (extra_buffer)	0.980	N/A
MTP alone	0.980	3.44
PCG + MTP	0.980	3.46

All three configs produce identical accuracy. PCG and MTP are fully compatible.

Updated the test to include --reasoning-parser qwen3 and thinking_mode='qwen3'.

narutolhy · 2026-04-05T02:27:18Z

Extended Verification — All Speculative Algorithms

Tested PCG compatibility with every available speculative algorithm:

Algorithm	Model	PCG Score	No-PCG Score	Acceptance	Status
EAGLE/NEXTN (MTP)	Qwen3.5-35B-A3B FP8 TP2	0.980	0.980	3.46	✅ Identical
EAGLE3	Qwen3-30B-A3B TP2	1.000	—	4.23	✅
STANDALONE	Qwen3-8B FP8 TP2	0.380	0.400	3.11	✅ Within noise (1 question / 50)
NGRAM	Qwen3-8B	N/A	N/A	—	⚠️ C++ compile error in v0.5.10rc0 (unrelated to PCG)

Conclusion: PCG has zero impact on accuracy across all tested speculative algorithms. The STANDALONE score drop (0.560→0.400) is from the weak draft model (Qwen3-0.6B), equally present with and without PCG.

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

narutolhy · 2026-04-10T18:06:16Z

/rerun-failed-ci

narutolhy · 2026-04-10T20:26:12Z

/rerun-failed-ci

narutolhy · 2026-04-10T20:55:35Z

/rerun-failed-ci

narutolhy · 2026-04-11T18:36:00Z

/rerun-failed-ci

narutolhy · 2026-04-13T01:44:47Z

/rerun-failed-ci

narutolhy · 2026-04-14T00:21:44Z

/rerun-failed-ci

narutolhy · 2026-04-14T03:06:19Z

/rerun-failed-ci

narutolhy · 2026-04-14T07:48:32Z

/rerun-failed-ci

narutolhy · 2026-04-15T01:13:36Z

/rerun-failed-ci

nvpohanh · 2026-04-15T01:58:30Z

cc @YAMY1234 @hlu1 @nvjullin for vis

narutolhy · 2026-04-15T19:26:29Z

/rerun-failed-ci

narutolhy · 2026-04-16T04:39:52Z

/rerun-failed-ci

cs-cat · 2026-04-16T06:41:55Z

Hi @narutolhy , I tested this PR on Qwen3.5-27B. However, there appears to be some performance degradation.
server command:

SGLANG_ENABLE_SPEC_V2=True python -m sglang.launch_server         --host 0.0.0.0 --port 30001         --model-path /models/Qwen3.5-27B-FP8         --served-model-name qwen         --mem-fraction-static 0.86         --context-length 262144         --reasoning-parser qwen3         --tool-call-parser qwen3_coder         --speculative-algo NEXTN         --speculative-num-steps 3         --speculative-eagle-topk 1         --speculative-num-draft-tokens 4         --max-running-requests 6         --trust-remote-code         --sleep-on-idle         --kv-cache-dtype fp8_e4m3         --enable-multimodal         --tp-size 2         --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --mamba-scheduler-strategy extra_buffer

benchmark command:

python3 -m sglang.bench_serving --backend sglang --num-prompt 256 --port 30001 --max-concurrency 1 --seed 2 --served-model-name qwen --model /models/Qwen3.5-27B-FP8

before PR:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  642.09    
Total input tokens:                      90070     
Total input text tokens:                 90070     
Total generated tokens:                  50148     
Total generated tokens (retokenized):    50138     
Request throughput (req/s):              0.40      
Input token throughput (tok/s):          140.28    
Output token throughput (tok/s):         78.10     
Peak output token throughput (tok/s):    115.00    
Peak concurrent requests:                4         
Total token throughput (tok/s):          218.38    
Concurrency:                             1.00      
Accept length:                           2.98      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2507.06   
Median E2E Latency (ms):                 1616.40   
P90 E2E Latency (ms):                    5882.81   
P99 E2E Latency (ms):                    10284.57  
---------------Time to First Token----------------
Mean TTFT (ms):                          198.80    
Median TTFT (ms):                        162.42    
P99 TTFT (ms):                           818.03    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.04     
Median TPOT (ms):                        12.36     
P99 TPOT (ms):                           28.52     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.84     
Median ITL (ms):                         8.75      
P95 ITL (ms):                            34.76     
P99 ITL (ms):                            35.04     
Max ITL (ms):                            45.31     
==================================================

after PR:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  671.34    
Total input tokens:                      90070     
Total input text tokens:                 90070     
Total generated tokens:                  50148     
Total generated tokens (retokenized):    50138     
Request throughput (req/s):              0.38      
Input token throughput (tok/s):          134.16    
Output token throughput (tok/s):         74.70     
Peak output token throughput (tok/s):    115.00    
Peak concurrent requests:                4         
Total token throughput (tok/s):          208.86    
Concurrency:                             1.00      
Accept length:                           2.98      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2621.29   
Median E2E Latency (ms):                 1735.10   
P90 E2E Latency (ms):                    6057.46   
P99 E2E Latency (ms):                    10436.05  
---------------Time to First Token----------------
Mean TTFT (ms):                          312.83    
Median TTFT (ms):                        272.58    
P99 TTFT (ms):                           1264.48   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.03     
Median TPOT (ms):                        12.37     
P99 TPOT (ms):                           28.48     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.84     
Median ITL (ms):                         8.75      
P95 ITL (ms):                            34.74     
P99 ITL (ms):                            35.04     
Max ITL (ms):                            45.79     
==================================================

narutolhy · 2026-04-16T07:35:28Z

hi @cs-cat
Thanks for testing this. I'll try to reproduce it and take a look.
With piecewise CUDA graphs, the initial few runs tend to be slower; they require a thorough warmup. This might be the reason. I'll capture a profile to take a look.

cs-cat · 2026-04-16T08:28:17Z

Hi @narutolhy , I tested this on a newer commit (with HiCache disabled due to a known bug) and found that this PR resulted in a performance improvement.
Previous tests were conducted on commit 0011d2a plus some performance patches/bugfixes ported from main/PR. The performance degradation is likely due to other issues/noise.

commit a4cf2ea1:
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  667.91    
Total input tokens:                      90070     
Total input text tokens:                 90070     
Total generated tokens:                  50148     
Total generated tokens (retokenized):    50135     
Request throughput (req/s):              0.38      
Input token throughput (tok/s):          134.85    
Output token throughput (tok/s):         75.08     
Peak output token throughput (tok/s):    115.00    
Peak concurrent requests:                4         
Total token throughput (tok/s):          209.93    
Concurrency:                             1.00      
Accept length:                           2.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2608.07   
Median E2E Latency (ms):                 1758.14   
P90 E2E Latency (ms):                    6044.83   
P99 E2E Latency (ms):                    10340.22  
---------------Time to First Token----------------
Mean TTFT (ms):                          292.48    
Median TTFT (ms):                        255.93    
P99 TTFT (ms):                           1202.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.11     
Median TPOT (ms):                        12.37     
P99 TPOT (ms):                           32.81     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.88     
Median ITL (ms):                         8.73      
P95 ITL (ms):                            34.71     
P99 ITL (ms):                            34.91     
Max ITL (ms):                            53.48     
==================================================

with PR:
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     256       
Benchmark duration (s):                  639.93    
Total input tokens:                      90070     
Total input text tokens:                 90070     
Total generated tokens:                  50148     
Total generated tokens (retokenized):    50135     
Request throughput (req/s):              0.40      
Input token throughput (tok/s):          140.75    
Output token throughput (tok/s):         78.36     
Peak output token throughput (tok/s):    115.00    
Peak concurrent requests:                5         
Total token throughput (tok/s):          219.11    
Concurrency:                             1.00      
Accept length:                           2.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2498.69   
Median E2E Latency (ms):                 1616.84   
P90 E2E Latency (ms):                    5964.01   
P99 E2E Latency (ms):                    10239.41  
---------------Time to First Token----------------
Mean TTFT (ms):                          181.40    
Median TTFT (ms):                        151.64    
P99 TTFT (ms):                           797.77    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.13     
Median TPOT (ms):                        12.39     
P99 TPOT (ms):                           32.86     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.89     
Median ITL (ms):                         8.73      
P95 ITL (ms):                            34.77     
P99 ITL (ms):                            34.94     
Max ITL (ms):                            38.06     
==================================================

narutolhy · 2026-04-16T17:52:31Z

Hi @cs-cat Thank you again. I will try to merge it soon.

narutolhy · 2026-04-16T17:59:06Z

/rerun-failed-ci

Oasis-Git

A small modification comment. Otherwise good to me

Oasis-Git · 2026-04-17T00:17:36Z

CI Pass

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Chen-0210 · 2026-05-08T05:34:21Z

Hi, @narutolhy. Did your MTP+pcg benchmark run with --enable-return-hidden-states to enable full capture_mode? I noticed that the default value of capture_mode in PCG capture is null, so does that mean PCG is not enabled by default when enable mtp?

f forward_batch.capture_hidden_mode != self.capture_hidden_mode:
            return False

narutolhy · 2026-05-08T20:52:18Z

Hi @Chen-0210
You're right that with default flags, the MTP+PCG run isn't actually exercising PCG on the target prefill — my earlier numbers need to be re-checked under that lens. Sorry for the noise.

I'll work on a follow-up to make this actually work with MTP.

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

narutolhy closed this Apr 5, 2026

narutolhy reopened this Apr 5, 2026

narutolhy force-pushed the feat/enable-pcg-with-mtp branch from e72639b to ae676ac Compare April 5, 2026 01:21

github-actions Bot added documentation Improvements or additions to documentation deepseek labels Apr 5, 2026

narutolhy force-pushed the feat/enable-pcg-with-mtp branch from ae676ac to 9112570 Compare April 5, 2026 01:21

narutolhy force-pushed the feat/enable-pcg-with-mtp branch 4 times, most recently from 9a103bf to 4b89c5e Compare April 5, 2026 05:00

github-actions Bot added the speculative-decoding label Apr 5, 2026

narutolhy force-pushed the feat/enable-pcg-with-mtp branch from 4b89c5e to c1602df Compare April 5, 2026 05:06

narutolhy requested a review from hebiao064 as a code owner April 5, 2026 05:06

narutolhy force-pushed the feat/enable-pcg-with-mtp branch 9 times, most recently from c36a8f4 to 528f957 Compare April 5, 2026 07:49

narutolhy force-pushed the feat/enable-pcg-with-mtp branch from 528f957 to 2015039 Compare April 5, 2026 07:59

Merge branch 'main' into feat/enable-pcg-with-mtp

19359bd

narutolhy changed the title ~~Allow piecewise CUDA graph with MTP (NEXTN) speculative decoding~~ Allow piecewise CUDA graph with speculative decoding Apr 5, 2026

Oasis-Git reviewed Apr 16, 2026

View reviewed changes

Comment thread python/sglang/srt/server_args.py Outdated

Qiaolin-Yu mentioned this pull request Apr 17, 2026

Speculative Decoding Development Roadmap (2026 Q2) #23005

Open

11 tasks

edit server args

35f4411

ispobock merged commit 5fa0c6a into sgl-project:main Apr 17, 2026
60 of 87 checks passed

moehanabi mentioned this pull request Apr 17, 2026

[PCG][MTP] fix: Disable piecewise CUDA graph for models without layers #20355

Closed

5 tasks

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Allow piecewise CUDA graph with speculative decoding (sgl-project#22128)

0bc05fa

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>

BBuf mentioned this pull request Apr 29, 2026

SGLang AI Agent Performance Optimization PRs (2026-01-29 to 2026-04-29) BBuf/AI-Infra-Auto-Driven-SKILLS#46

Open

Conversation

narutolhy commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Accuracy Verification (GSM8K, 50 questions per config)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100)

Modification

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

narutolhy commented Apr 5, 2026

Uh oh!

narutolhy commented Apr 5, 2026

Uh oh!

narutolhy commented Apr 5, 2026

Extended Verification — All Speculative Algorithms

Uh oh!

narutolhy commented Apr 10, 2026

Uh oh!

narutolhy commented Apr 10, 2026

Uh oh!

narutolhy commented Apr 10, 2026

Uh oh!

narutolhy commented Apr 11, 2026

Uh oh!

narutolhy commented Apr 13, 2026

Uh oh!

narutolhy commented Apr 14, 2026

Uh oh!

narutolhy commented Apr 14, 2026

Uh oh!

narutolhy commented Apr 14, 2026

Uh oh!

narutolhy commented Apr 15, 2026

Uh oh!

nvpohanh commented Apr 15, 2026

Uh oh!

narutolhy commented Apr 15, 2026

Uh oh!

narutolhy commented Apr 16, 2026

Uh oh!

cs-cat commented Apr 16, 2026

Uh oh!

narutolhy commented Apr 16, 2026

Uh oh!

cs-cat commented Apr 16, 2026

Uh oh!

narutolhy commented Apr 16, 2026

Uh oh!

narutolhy commented Apr 16, 2026

Uh oh!

Oasis-Git left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Oasis-Git commented Apr 17, 2026

Uh oh!

Uh oh!

Chen-0210 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

narutolhy commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

narutolhy commented Apr 5, 2026 •

edited

Loading

Chen-0210 commented May 8, 2026 •

edited

Loading