Enable Pipeline Parallelism support for Piecewise CUDA Graph #14515 by baonudesifeizhai · Pull Request #14547 · sgl-project/sglang

baonudesifeizhai · 2025-12-06T18:20:50Z

Motivation

Previously, piecewise CUDA graph was explicitly disabled when pipeline parallelism (PP) was enabled. This PR enables PP support for piecewise CUDA graph, allowing users to benefit from both optimizations simultaneously.

#14515

Modifications

Removed the PP size check that disabled piecewise CUDA graph in ModelRunner.can_run_piecewise_cuda_graph()
Added PP proxy tensors buffer initialization in PiecewiseCudaGraphRunner.__init__()
Added PP proxy tensors handling in PiecewiseCudaGraphRunner.capture_one_batch_size() for graph capture
Added PP proxy tensors handling in PiecewiseCudaGraphRunner.replay() to properly pass tensors from pre-allocated buffers during replay
Cached model signature check to avoid expensive inspect.signature() calls in the hot path

Accuracy Tests

https://paste.ubuntu.com/p/KWKwszGRvd/
res:
]

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

This reverts commit 6f48bbd.

gemini-code-assist · 2025-12-06T18:21:08Z

Summary of Changes

Hello @baonudesifeizhai, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Pipeline Parallelism (PP) support into the existing piecewise CUDA graph functionality. Previously, these two performance optimizations were mutually exclusive. The changes involve removing the explicit disabling of piecewise CUDA graphs when PP is active, and implementing the necessary mechanisms within the PiecewiseCudaGraphRunner to correctly manage and pass proxy tensors required for PP during both CUDA graph capture and replay, thereby allowing users to leverage both optimizations simultaneously for improved performance.

Highlights

Enabled Pipeline Parallelism (PP) for Piecewise CUDA Graphs: The core change allows these two optimization techniques to work together, which was previously disabled.
Removed PP Size Check: The explicit check in ModelRunner.can_run_piecewise_cuda_graph() that prevented piecewise CUDA graphs from running with pp_size > 1 has been removed.
Initialized PP Proxy Tensors Buffer: A buffer for pp_proxy_tensors (hidden states and residual) is now initialized in PiecewiseCudaGraphRunner.__init__() when PP is enabled.
Handled PP Proxy Tensors during Graph Capture: Logic was added in PiecewiseCudaGraphRunner.capture_one_batch_size() to create and pass pp_proxy_tensors during the CUDA graph capture phase.
Managed PP Proxy Tensors during Replay: The PiecewiseCudaGraphRunner.replay() method now properly handles and passes pp_proxy_tensors from pre-allocated buffers to the model's forward pass.
Cached Model Signature Check: An expensive inspect.signature() call to check for pp_proxy_tensors parameter in model.forward is now cached to improve performance in hot paths.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enables pipeline parallelism support for piecewise CUDA graphs, which is a significant enhancement. The changes correctly handle pp_proxy_tensors during graph capture and replay by using a pre-allocated buffer, and include a nice performance optimization by caching the model signature check. However, I've identified a critical issue that will cause a TypeError at runtime, and a minor inconsistency in the handling of mrope_positions that could affect maintainability and resource usage. Please see the detailed comments for suggestions on how to address these points. Overall, great work on tackling this complex feature.

gemini-code-assist · 2025-12-06T18:23:29Z

+            self.mrope_positions = torch.zeros(
+                (3, self.max_num_tokens), dtype=torch.int64
+            )


The initialization of self.mrope_positions has been moved out of the if self.is_multimodal: block. While this might be intentional if mrope is used by non-multimodal models, the comment on line 234 now becomes misleading as it states mrope_positions is only for multimodal models. Additionally, other parts of the code still check self.is_multimodal before using mrope_positions (e.g., lines 354, 439). This creates an inconsistency.

To improve clarity and correctness, please either:

Move the initialization back inside the if self.is_multimodal: block if it's only for multimodal models.

If it's used more broadly, update the comment on line 234 and change the checks from self.is_multimodal to a more appropriate condition (e.g., self.model_runner.model_is_mrope).

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ispobock

Could you add an unit test?

baonudesifeizhai · 2025-12-11T15:08:23Z

python3 test_piecewise_cuda_graph.py TestPiecewiseCudaGraphWithPP.test_gsm8k_accuracy pass

ispobock · 2025-12-12T06:39:57Z

Could you fix the lint issue?

baonudesifeizhai · 2025-12-12T13:47:55Z

already fixed!

Could you fix the lint issue?

ispobock · 2025-12-13T05:29:56Z

/tag-and-rerun-ci

ispobock · 2025-12-18T11:51:06Z

@baonudesifeizhai Could you resolve the conflict?

…a-graph-pp-support

to main first

baonudesifeizhai · 2025-12-18T16:43:16Z

python -m pytest test/srt/test_piecewise_cuda_graph_2_gpu.py::TestPiecewiseCudaGraphWithPP -v passed

Oasis-Git

In general the function is good. However before merge we may need to do some code modification.

Oasis-Git · 2025-12-20T00:23:33Z

-                "Disable piecewise CUDA graph because piecewise_cuda_graph does not support PP",
-            )
-            return False
+        # PP support is now enabled for piecewise CUDA graph


remove this line

Oasis-Git · 2025-12-20T00:25:32Z

-                self.mrope_positions = torch.zeros(
-                    (3, self.max_num_tokens), dtype=torch.int64
-                )
+            self.mrope_positions = torch.zeros(


Does PP need mrope or not? if not allocate the buffer when is_multimodal is True

Oasis-Git · 2025-12-20T00:46:59Z

    def replay_prepare(
        self,
        forward_batch: ForwardBatch,
+        pp_proxy_tensors: Optional[PPProxyTensors] = None,


Since you do not use pp_proxy_tensors please remove it

Oasis-Git · 2025-12-20T00:47:47Z

        with enable_piecewise_cuda_graph(), disable_ca_comm(self.model_runner.tp_group):
            self.model_runner.attn_backend.init_forward_metadata(forward_batch)
-            static_forward_batch = self.replay_prepare(forward_batch, **kwargs)
+            # Extract pp_proxy_tensors from kwargs if present (avoid in-place modification)


With original replay_prepare there is no need to split pp_proxy_tensors from kwargs

Oasis-Git · 2025-12-20T00:54:23Z

+                # Note: piecewise captures with bs=1, but we need buffer for PP proxy tensors
+                # The buffer size is 1 since we capture with batch_size=1
+                self.pp_proxy_tensors_buffer = {
+                    "hidden_states": torch.zeros(


Why u need self.pp_proxy_tensors_buffer and why the shape is [1, hidden_dim] instead of ['num_tokens, hidden_dim] here

baonudesifeizhai · 2025-12-20T02:45:13Z

 python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-7B-Instruct \
    --enable-piecewise-cuda-graph \
    --pipeline-parallel-size 2 \
    --tp 4 \
    --trust-remote-code \
    --port 30000

python test/srt/test_piecewise_cuda_graph_2_gpu.py TestPiecewiseCudaGraphWithPP pass

Oasis-Git · 2025-12-23T01:14:00Z

/tag-and-rerun-ci

baonudesifeizhai · 2025-12-30T04:18:07Z

works normal after fix conflicts

BBuf and others added 11 commits November 29, 2025 13:57

add moe_wna16_marlin_gemm_v2

6f48bbd

Revert "add moe_wna16_marlin_gemm_v2"

eeea208

This reverts commit 6f48bbd.

Merge branch 'sgl-project:main' into main

7ec9d22

Merge branch 'main' of https://github.com/baonudesifeizhai/sglang

deaed14

Merge branch 'main' of https://github.com/baonudesifeizhai/sglang

19dcc74

Merge branch 'main' of https://github.com/baonudesifeizhai/sglang

a273f23

Merge branch 'main' of https://github.com/baonudesifeizhai/sglang

801b6a8

Merge branch 'main' of https://github.com/baonudesifeizhai/sglang

1c02c54

Merge branch 'main' of https://github.com/baonudesifeizhai/sglang

5b4c919

add

c3fea68

add debuging

8f67d3a

baonudesifeizhai requested review from Fridge003, Ying1123, hebiao064, hnyls2002, ispobock and merrymercy as code owners December 6, 2025 18:20

gemini-code-assist Bot reviewed Dec 6, 2025

View reviewed changes

baonudesifeizhai and others added 2 commits December 6, 2025 15:11

fix

33609ce

Update python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

f3193b5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ispobock reviewed Dec 11, 2025

View reviewed changes

ispobock assigned BBuf and Oasis-Git Dec 11, 2025

baonudesifeizhai added 2 commits December 11, 2025 09:57

add unit test

81ee554

fix test

baa1bd5

fix lint issue

ae262be

github-actions Bot added the run-ci label Dec 13, 2025

ispobock mentioned this pull request Dec 17, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Closed

34 tasks

baonudesifeizhai and others added 6 commits December 18, 2025 11:03

Merge remote-tracking branch 'origin/main' into feature/piecewise-cud…

d77462c

…a-graph-pp-support

fix

a42ccfb

change back

c4c6946

to main first

Merge branch 'main' into feature/piecewise-cuda-graph-pp-support

1c7730b

add back unit test

cd1d4b2

fix

17dde73

Oasis-Git reviewed Dec 20, 2025

View reviewed changes

baonudesifeizhai added 3 commits December 19, 2025 20:55

fix

68e7293

fix

830e23e

fix numstoken

59120d0

fix merge conflict

b04a6a2

fix conflict

bb8a785

Merge branch 'main' into feature/piecewise-cuda-graph-pp-support

9d3e4bf

Conversation

baonudesifeizhai commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 6, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock left a comment

Choose a reason for hiding this comment

Uh oh!

baonudesifeizhai commented Dec 11, 2025

Uh oh!

ispobock commented Dec 12, 2025

Uh oh!

baonudesifeizhai commented Dec 12, 2025

Uh oh!

ispobock commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ispobock commented Dec 18, 2025

Uh oh!

baonudesifeizhai commented Dec 18, 2025

Uh oh!

Oasis-Git left a comment

Choose a reason for hiding this comment

Uh oh!

Oasis-Git Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Oasis-Git Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Oasis-Git Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Oasis-Git Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Oasis-Git Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

baonudesifeizhai Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

baonudesifeizhai commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oasis-Git commented Dec 23, 2025

Uh oh!

baonudesifeizhai commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

baonudesifeizhai commented Dec 6, 2025 •

edited

Loading

ispobock commented Dec 13, 2025 •

edited

Loading

baonudesifeizhai commented Dec 20, 2025 •

edited

Loading