Allow piecewise CUDA graph with speculative decoding#22128
Allow piecewise CUDA graph with speculative decoding#22128ispobock merged 12 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the server arguments to allow the 'NEXTN' speculative decoding algorithm to work with piecewise CUDA graphs, as they do not conflict. The review feedback suggests using a set for checking compatible algorithms to enhance code maintainability and extensibility.
| if ( | ||
| self.speculative_algorithm is not None | ||
| and self.speculative_algorithm != "NEXTN" | ||
| ): | ||
| self.disable_piecewise_cuda_graph = True |
There was a problem hiding this comment.
For better maintainability and to make it easier to add other compatible speculative decoding algorithms in the future, consider using a set for the check. This makes the intent clearer and the code more extensible.
| if ( | |
| self.speculative_algorithm is not None | |
| and self.speculative_algorithm != "NEXTN" | |
| ): | |
| self.disable_piecewise_cuda_graph = True | |
| if ( | |
| self.speculative_algorithm is not None | |
| and self.speculative_algorithm not in {"NEXTN"} | |
| ): | |
| self.disable_piecewise_cuda_graph = True |
|
Closing: found that PCG + speculative decoding causes accuracy degradation (GSM8K 12.5% vs expected >75%). The original restriction was correct. PCG captures with spec_info=None but the model forward path behaves differently with speculative decoding active, causing incorrect outputs. |
|
Reopening: the accuracy issue was caused by missing Re-tested with proper reasoning parser on a fresh machine (2xH100):
All three configs produce identical accuracy. PCG and MTP are fully compatible. Updated the test to include |
e72639b to
ae676ac
Compare
ae676ac to
9112570
Compare
Extended Verification — All Speculative AlgorithmsTested PCG compatibility with every available speculative algorithm:
Conclusion: PCG has zero impact on accuracy across all tested speculative algorithms. The STANDALONE score drop (0.560→0.400) is from the weak draft model (Qwen3-0.6B), equally present with and without PCG. |
9a103bf to
4b89c5e
Compare
4b89c5e to
c1602df
Compare
c36a8f4 to
528f957
Compare
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
528f957 to
2015039
Compare
|
/rerun-failed-ci |
8 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
Hi @narutolhy , I tested this PR on Qwen3.5-27B. However, there appears to be some performance degradation. benchmark command: before PR: after PR: |
|
hi @cs-cat |
|
Hi @narutolhy , I tested this on a newer commit (with HiCache disabled due to a known bug) and found that this PR resulted in a performance improvement. |
|
Hi @cs-cat Thank you again. I will try to merge it soon. |
|
/rerun-failed-ci |
Oasis-Git
left a comment
There was a problem hiding this comment.
A small modification comment. Otherwise good to me
|
CI Pass |
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi, @narutolhy. Did your MTP+pcg benchmark run with |
|
Hi @Chen-0210 I'll work on a follow-up to make this actually work with MTP. |
Summary
--enable-piecewise-cuda-graphto coexist with all speculative decoding algorithms (EAGLE/EAGLE3/NEXTN/STANDALONE/NGRAM)Motivation
PCG and speculative decoding operate on independent forward paths:
ForwardMode.EXTEND) withspec_info=NoneForwardMode.TARGET_VERIFY)The restriction was added in #16331 as a conservative safety measure when PCG became default-enabled. The original PCG implementation (#10062) had no speculative restriction.
Accuracy Verification (GSM8K, 50 questions per config)
STANDALONE's lower score (0.56→0.40) is from the weak draft model (Qwen3-0.6B), equally present with and without PCG.
Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100)
--mamba-scheduler-strategy extra_bufferfor all configs.PCG adds prefill acceleration on top of MTP's decode speedup:
Modification
server_args.py: Removed the blanket disable of PCG for speculative decoding. Added comments explaining why PCG and speculative decoding are compatible (independent forward paths).Test plan
test/registered/piecewise_cuda_graph/test_pcg_with_mtp.py🤖 Generated with Claude Code