Piecewise CUDA Graph Support & Torch Compile Backend by Oasis-Git · Pull Request #10062 · sgl-project/sglang

Oasis-Git · 2025-09-05T08:13:15Z

Motivation

This PR introduces support for Piecewise CUDA Graph and Torch Backend Support in SGLang.

This feature is inspired by the implementation in vLLM v1 which partitions the full inference graph into smaller sub-graphs with attention operations. This approach offers several potential advantages over full-graph capture:

Improved Prefill Performance: SGLang's current native prefilling can incur kernel launch overhead. By capturing operations in a piecewise graph, we can reduce this overhead and improve performance, especially for the short length input.
Enables Advanced Attention Operations: This implementation allows for complex/external operations within the attention mechanism, such as layerwise KV Cache loading and storing during decoding which are not possible with a full CUDA graph.
Prefilling-Decoding Mixture: With piecewise CUDA Graph prefiling and decoding operations can be mixed within the same running batch as the same operation.

Based on discussion above, we are going to introduce piecewise CUDA graph as optional support for SGLang. Our implementation is primarily adapted from the vLLM open-source project and we extend our credit to their team for the foundational work.

Modifications

The main modification includes:

Custom torch.compile Backend: Introduced SGLangBackend to manage FX graph compilation and CUDA graph execution, located in sglang/python/srt/model_executor/compilation.
Piecewise CUDA Graph Runner: Added PiecewiseCudagraphRunner to handle the segmented capture and replay of CUDA graphs for model execution.
Radix Attention Integration: Wrapped the Radix Attention kernel and registered it as a custom torch.ops operator to make it compatible with torch.compile.

Accuracy Tests

Gsm 8K Test Outcome for Qwen3-8B model with H100

# With Piecewise Cuda Graph
$ bash gsm8k.sh 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 26.07it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 7.752 s
Output throughput: 3076.427 token/s

# Without Piecewise Cuda Graph
$ bash gsm8k.sh 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:11<00:00, 17.67it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 11.384 s
Output throughput: 2091.467 token/s

Benchmarking and Profiling

Profiling

Here we provide the profile outcome of piecewise cuda graph implementation:

Prefill-Only Benchmark

Here we provide the benchmark outcome with different model size on prefill-only scenario on H100.

The script:

python -m sglang.bench_one_batch --model-path Qwen/Qwen3-14B \
    --batch 1 --input-len 16 32 64 128 256 512 --output-len 1 \
    --enable-piecewise-cuda-graph

E2E Benchmark

Gsm 8K Test Outcome for Qwen3-8B model with H100

# With Piecewise Cuda Graph
$ bash gsm8k.sh 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 26.07it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 7.752 s
Output throughput: 3076.427 token/s

# Without Piecewise Cuda Graph
$ bash gsm8k.sh 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:11<00:00, 17.67it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 11.384 s
Output throughput: 2091.467 token/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

DevashishLal-CB · 2025-09-08T21:09:23Z

+            "enable_auto_functionalized_v2": False,
+        }
+
+    def configure_post_pass(self):


you could drop the pass manager changes and the fix functionalization (not required with auto_functionalize_v2) from this MR and the piecewise backend would still work.

This would enable us to work in parallel. as far as pass manager is concerned It only needs to hook into inductor_config["post_grad_custom_post_pass"]

Ideally I would love to support the pass manager without cuda graphs, for non-cuda backends so it would be good keep these two things as isolated as possible

Yes. This part can be updated base on your pr.

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Edenzzzz · 2025-09-18T13:16:02Z

+        # can reuse the memory pool allocated for the large shapes.
+        with freeze_gc():
+            # Only rank 0 should print progress bar during capture
+            self.cudagraph_batch_sizes = [512, 256, 128, 64, 32, 16, 8, 4, 2, 1]


Are these sizes for prefill only? AFAK it doesn't take two graphs to mix decode and prefill, --enable-mixed-chunk does that in one forward.

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

BBuf · 2025-10-17T07:02:54Z

@@ -326,10 +327,18 @@ def __init__(
        self.qr_comm: Optional[QuickAllReduce] = None
        if use_custom_allreduce and self.world_size > 1:
            # Initialize a custom fast all-reduce implementation.


This change can cause allreduce kernel performance drop, for example:

python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --tp 8 --port 30000 --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192 python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --tp 8 python3 -m sglang.bench_serving --backend sglang-oai --dataset-name random --random-input-len 4096 --random-output-len 20 --random-range-ratio 1 --num-prompts 10 --max-concurrency 1 --warmup-requests 3 --profile

370us->461us in prefill.

Oasis-Git · 2025-10-17T07:03:47Z

@Bruce-x-1997 Basically it should not. However if you encounter any issues please let me know since piecewise cuda graph support is still in experimental period.

thanks, I am trying to make piecewise-cuda-graph support deepseekv3. but now I meet some graph-compiling bugs I see you use fullgraph=True, but it meets alot of graph break. Then I set fullgraph=False, but I see some data return error, like mla k_rope null do you have any idea to address it? I see torch.compile at normal cudagraph works well, and it uses default fullgraph=False So I think I could just change some piecewise cuda graph procedure, but I am not familiar to dynamic config in the piecewise cuda graph

Yes Deepseek V3 is the main issue model we are trying to apply. We hope to support this model with piecewise cuda graph as soon as possible and related code and benchmark will be in a separate PR.
Here are some known issues we are working on to support ds-v3 model: MLA support, EP kernel support and some related rope issues. To support this model it still needs much effort and we hope to finish it within this week,

Could you share the related code as a patch or pr？I just use fused moe, so mla and moe could work is enough for me.And BTW, I think cuda-graph might not need to change kernel.I see cudagraph works well at decode stage, and I even supported mla on the pr of 2d13b54 and It could work at prefill stage(just could not support radix cache).

It is not related to cuda graph only. Since Piecewise CUDA Graph Support is mainly based on torch compile, actually most of the issues fall in torch compile backend side. If you are interested in the pr we will release soon.

you mean you want to solve all graph breaks, and still set fullgraph=True, right?

Yes we hope to set these logics in fix_functionalization.py later.

yansiyu550 · 2025-10-17T07:06:34Z

@Zhiy-Zhang Hi, thanks for your help! Yes flashinfer backend is not well tested and this error has been reported previously. Basically this can be fixed by provide proper metadata in piecewise_cuda_graph. If you hope to contribute on this, please list the error in issue #11490 and at your self. We also welcome you to enrolled in the channel of piecewise_cuda_graph of slack for discussion!

I had the same error. After adding the following code, the issue was resolved for me.

Zhiy-Zhang · 2025-10-17T07:09:10Z

@yansiyu550 Actually, this modification is not correct. I tested it on the dense model of qwen3-4B and it works fine, but for the MoE model, the results differ significantly compared to those using FlashAttention. I suspect the issue is related to the input format of the FlashInfer operators. I'm still working on the fix.

yansiyu550 · 2025-10-17T07:12:34Z

@yansiyu550 Hi, thanks for your benchmark. I think it is correct and acceptable due to following reason:

One of the most important reason is that during torch compile all the custom_ops is set back to native forward version which has obvious performance drop compared with specifc kernel design in cuda forward implementaion, especially in heavy load

Your benchmark is a relatively decode heavy benchmark so prefilling optimization could be trivial

Our benchmark shows that for most of the model, it can not have improvement with token number >= 4096

We are still on the way for better support of piecewise cuda graph such as sgl_kernel support and hopefully this feature can achieve better performance later for your benchmark.

Thanks for your reply!
Just to clarify — I did not enable --enable-torch-compile, but there was still no performance improvement when using --enable-piecewise-cuda-graph.

My benchmark is actually Prefill-heavy, with input_len=2048 and output_len=128.

In addition, when I tried TP=2 (two GPUs), the server failed to start properly.
Does this feature currently not support multi-GPU / tensor parallel configurations?

yansiyu550 · 2025-10-17T07:14:41Z

@yansiyu550 Actually, this modification is not correct. I tested it on the dense model of qwen3-4B and it works fine, but for the MoE model, the results differ significantly compared to those using FlashAttention. I suspect the issue is related to the input format of the FlashInfer operators. I'm still working on the fix.

I ran into the same situation — this change works fine for the dense model, but it causes problems when running the MoE model.

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fjybiocs · 2025-11-06T13:41:09Z

I noticed P nodes in PD disaggregation automatically disable cuda graph. But shouldn't piecewise cuda graph actually help with prefill performance on P nodes? Is this supported yet or are there some technical blockers?

EduardDurech · 2025-12-23T12:05:33Z

@Oasis-Git for future reference https://peps.python.org/pep-0440

>>> "2.10"<"2.6"
True

Fixed in #15682

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration - MTP acceptance: 3.29 tokens/step Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy (Qwen3.5-35B-A3B FP8, TP2, H100, GSM8K 50q, --reasoning-parser qwen3): - Baseline (extra_buffer): 0.980 - MTP alone: 0.980, acceptance=3.44 - PCG + MTP: 0.980, acceptance=3.46 Benchmark (extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Oasis-Git added 5 commits August 21, 2025 06:13

hardcode init

e115d92

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

remove config

a5cba64

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

refactor code

0d36ba1

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

dummy run update

60c7344

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

revoke sgl kernel

980c062

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git mentioned this pull request Sep 7, 2025

[RFC] SGLang unified kernel fusion and torch compile optimisations #10118

Closed

DevashishLal-CB reviewed Sep 8, 2025

View reviewed changes

Oasis-Git and others added 3 commits September 11, 2025 08:28

update benchmark function

20de50e

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Merge remote-tracking branch 'upstream/main' into compile

1ec00d7

update after merge

2fbe52f

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Edenzzzz reviewed Sep 18, 2025

View reviewed changes

Oasis-Git added 2 commits September 19, 2025 00:05

remove kernel rely

e33961c

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

code style check

4e9cccb

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

DevashishLal-CB mentioned this pull request Sep 30, 2025

[fusion] add composable fusion pass framework #10549

Draft

4 tasks

add piecewise cuda graph runner

4eab636

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git changed the title ~~[WIP] Piecewise CUDA Graph Support~~ Piecewise CUDA Graph Support & Torch Compile Backend Oct 3, 2025

Oasis-Git marked this pull request as ready for review October 3, 2025 00:56

Oasis-Git requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and zhyncs as code owners October 3, 2025 00:56

Oasis-Git added 3 commits October 4, 2025 08:46

padding for replay

68d3349

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

code format

39d2c82

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

compile config

901d604

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

BBuf reviewed Oct 17, 2025

View reviewed changes

zyksir mentioned this pull request Oct 25, 2025

[Fix] fix allreduce bug in Piecewise Graph #12106

Merged

4 tasks

Vladimir221 mentioned this pull request Oct 28, 2025

[Ascend]Support of piecewise graph compilation for prefill on NPU #12287

Merged

4 tasks

lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025

Piecewise CUDA Graph Support & Torch Compile Backend (sgl-project#10062)

e9d524a

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

narutolhy mentioned this pull request Apr 5, 2026

Allow piecewise CUDA graph with speculative decoding #22128

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Piecewise CUDA Graph Support & Torch Compile Backend#10062

Piecewise CUDA Graph Support & Torch Compile Backend#10062
ispobock merged 27 commits intosgl-project:mainfrom
Oasis-Git:compile

Oasis-Git commented Sep 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

DevashishLal-CB Sep 8, 2025

Uh oh!

Oasis-Git Sep 18, 2025

Uh oh!

Edenzzzz Sep 18, 2025

Uh oh!

BBuf Oct 17, 2025

Uh oh!

Oasis-Git commented Oct 17, 2025 •

edited

Loading

Uh oh!

yansiyu550 commented Oct 17, 2025

Uh oh!

Zhiy-Zhang commented Oct 17, 2025 •

edited

Loading

Uh oh!

yansiyu550 commented Oct 17, 2025

Uh oh!

yansiyu550 commented Oct 17, 2025

Uh oh!

fjybiocs commented Nov 6, 2025

Uh oh!

EduardDurech commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

Oasis-Git commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Profiling

Prefill-Only Benchmark

E2E Benchmark

Checklist

Uh oh!

Uh oh!

DevashishLal-CB Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Oasis-Git Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

BBuf Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Oasis-Git commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yansiyu550 commented Oct 17, 2025

Uh oh!

Zhiy-Zhang commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yansiyu550 commented Oct 17, 2025

Uh oh!

yansiyu550 commented Oct 17, 2025

Uh oh!

fjybiocs commented Nov 6, 2025

Uh oh!

EduardDurech commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Oasis-Git commented Sep 5, 2025 •

edited

Loading

Oasis-Git commented Oct 17, 2025 •

edited

Loading

Zhiy-Zhang commented Oct 17, 2025 •

edited

Loading