Piecewise CUDA Graph Support & Torch Compile Backend#10062
Piecewise CUDA Graph Support & Torch Compile Backend#10062ispobock merged 27 commits intosgl-project:mainfrom
Conversation
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
| "enable_auto_functionalized_v2": False, | ||
| } | ||
|
|
||
| def configure_post_pass(self): |
There was a problem hiding this comment.
you could drop the pass manager changes and the fix functionalization (not required with auto_functionalize_v2) from this MR and the piecewise backend would still work.
This would enable us to work in parallel. as far as pass manager is concerned It only needs to hook into inductor_config["post_grad_custom_post_pass"]
Ideally I would love to support the pass manager without cuda graphs, for non-cuda backends so it would be good keep these two things as isolated as possible
There was a problem hiding this comment.
Yes. This part can be updated base on your pr.
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
| # can reuse the memory pool allocated for the large shapes. | ||
| with freeze_gc(): | ||
| # Only rank 0 should print progress bar during capture | ||
| self.cudagraph_batch_sizes = [512, 256, 128, 64, 32, 16, 8, 4, 2, 1] |
There was a problem hiding this comment.
Are these sizes for prefill only? AFAK it doesn't take two graphs to mix decode and prefill, --enable-mixed-chunk does that in one forward.
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
| @@ -326,10 +327,18 @@ def __init__( | |||
| self.qr_comm: Optional[QuickAllReduce] = None | |||
| if use_custom_allreduce and self.world_size > 1: | |||
| # Initialize a custom fast all-reduce implementation. | |||
There was a problem hiding this comment.
This change can cause allreduce kernel performance drop, for example:
python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --tp 8 --port 30000 --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192
python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --tp 8
python3 -m sglang.bench_serving --backend sglang-oai --dataset-name random --random-input-len 4096 --random-output-len 20 --random-range-ratio 1 --num-prompts 10 --max-concurrency 1 --warmup-requests 3 --profile
370us->461us in prefill.
Yes we hope to set these logics in |
I had the same error. After adding the following code, the issue was resolved for me. |
|
@yansiyu550 Actually, this modification is not correct. I tested it on the dense model of qwen3-4B and it works fine, but for the MoE model, the results differ significantly compared to those using FlashAttention. I suspect the issue is related to the input format of the FlashInfer operators. I'm still working on the fix. |
Thanks for your reply! My benchmark is actually Prefill-heavy, with input_len=2048 and output_len=128. In addition, when I tried TP=2 (two GPUs), the server failed to start properly. |
I ran into the same situation — this change works fine for the dense model, but it causes problems when running the MoE model. |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
I noticed P nodes in PD disaggregation automatically disable cuda graph. But shouldn't piecewise cuda graph actually help with prefill performance on P nodes? Is this supported yet or are there some technical blockers? |
|
@Oasis-Git for future reference https://peps.python.org/pep-0440 >>> "2.10"<"2.6"
TrueFixed in #15682 |
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration - MTP acceptance: 3.29 tokens/step Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy (Qwen3.5-35B-A3B FP8, TP2, H100, GSM8K 50q, --reasoning-parser qwen3): - Baseline (extra_buffer): 0.980 - MTP alone: 0.980, acceptance=3.44 - PCG + MTP: 0.980, acceptance=3.46 Benchmark (extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Motivation
This PR introduces support for Piecewise CUDA Graph and Torch Backend Support in SGLang.
This feature is inspired by the implementation in vLLM v1 which partitions the full inference graph into smaller sub-graphs with attention operations. This approach offers several potential advantages over full-graph capture:
Improved Prefill Performance: SGLang's current native prefilling can incur kernel launch overhead. By capturing operations in a piecewise graph, we can reduce this overhead and improve performance, especially for the short length input.
Enables Advanced Attention Operations: This implementation allows for complex/external operations within the attention mechanism, such as layerwise KV Cache loading and storing during decoding which are not possible with a full CUDA graph.
Prefilling-Decoding Mixture: With piecewise CUDA Graph prefiling and decoding operations can be mixed within the same running batch as the same operation.
Based on discussion above, we are going to introduce piecewise CUDA graph as optional support for SGLang. Our implementation is primarily adapted from the vLLM open-source project and we extend our credit to their team for the foundational work.
Modifications
The main modification includes:
Custom
torch.compileBackend: IntroducedSGLangBackendto manage FX graph compilation and CUDA graph execution, located insglang/python/srt/model_executor/compilation.Piecewise CUDA Graph Runner: Added
PiecewiseCudagraphRunnerto handle the segmented capture and replay of CUDA graphs for model execution.Radix Attention Integration: Wrapped the Radix Attention kernel and registered it as a custom
torch.opsoperator to make it compatible withtorch.compile.Accuracy Tests
Gsm 8K Test Outcome for Qwen3-8B model with H100
Benchmarking and Profiling
Profiling
Here we provide the profile outcome of piecewise cuda graph implementation:
Prefill-Only Benchmark
Here we provide the benchmark outcome with different model size on prefill-only scenario on H100.
The script:
E2E Benchmark
Gsm 8K Test Outcome for Qwen3-8B model with H100
Checklist