[PP] Refactor Async Pipeline Parallelism by yuan-luo · Pull Request #19582 · sgl-project/sglang

yuan-luo · 2026-02-28T14:05:14Z

Motivation

Current the async mode pipeline parallelism mechanism is a sort of "fake" async: In recv_tensor_dict, each tensor is immediately waited on after calling irecv, resulting in serial reception and preventing overlap of communication and computation.

This PR refactors an authentic async PP mechanism with the following design:

Introduce isend_tensor_dict: Send all isend calls and return a List[P2PWork]. The key addition is tensor.record_stream() to prevent CUDA tensors from being prematurely released (this exists in vllm but was not in the original sglang).
Rewrite send_tensor_dict: Transform it into a simple wrapper around isend_tensor_dict + wait, maintaining API compatibility.
Introduce irecv_tensor_dict: Initiate all irecv calls at once, returning a tuple of (tensor_dict, handles, postprocess). The postprocess is used for reconstructing in all_gather.
Rewrite recv_tensor_dict: Transform it into a wrapper that combines irecv_tensor_dict + wait + postprocess.

The performance improved 5-7%. More comprehensive performance test will be done.

Modifications

Accuracy Tests

Acc no drops:
Server:

➜  python git:(refactor_async_pp) ✗ python -m sglang.launch_server --model-path Qwen/Qwen3-VL-8B-Instruct --tp-size 2 --pp-size 2

Client

➜  sglang_dev git:(refactor_async_pp) ✗ OPENAI_API_BASE=http://0.0.0.0:30000/v1 OPENAI_API_KEY="" python3 -m lmms_eval     --model openai_compatible     --model_args model_version=Qwen/Qwen3-VL-8B-Instruct  --tasks mmmu_val     --batch_size 16
2026-02-28 14:22:18 | INFO     | __main__:cli_evaluate:476 - Verbosity set to INFO
2026-02-28 14:22:21 | INFO     | __main__:cli_evaluate_single:565 - Evaluation tracker args: {}
2026-02-28 14:22:21 | INFO     | __main__:cli_evaluate_single:649 - Selected Tasks: ['mmmu_val']
2026-02-28 14:22:21 | INFO     | lmms_eval.evaluator:simple_evaluate:170 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2026-02-28 14:22:22 | INFO     | lmms_eval.evaluator:evaluate:515 - Running on rank 0 (local rank 0)
2026-02-28 14:22:22 | INFO     | lmms_eval.api.task:build_all_requests:428 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 13380.05it/s]
2026-02-28 14:22:22 | INFO     | lmms_eval.evaluator:evaluate:609 - Running generate_until requests
Model Responding: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 899/900 [02:08<00:00,  9.95it/s]2026-02-28 14:24:30 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:136 - Metric summary - Total elapsed time: 2450.723s, Total gen tokens: 25464, Avg speed: 10.4 tokens/s
Model Responding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [02:08<00:00,  7.01it/s]
Postprocessing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 9727.30it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.7}, 'Art': {'num': 30, 'acc': 0.73333}, 'Art_Theory': {'num': 30, 'acc': 0.9}, 'Design': {'num': 30, 'acc': 0.73333}, 'Music': {'num': 30, 'acc': 0.43333}, 'Overall-Business': {'num': 150, 'acc': 0.44}, 'Accounting': {'num': 30, 'acc': 0.4}, 'Economics': {'num': 30, 'acc': 0.56667}, 'Finance': {'num': 30, 'acc': 0.23333}, 'Manage': {'num': 30, 'acc': 0.43333}, 'Marketing': {'num': 30, 'acc': 0.56667}, 'Overall-Science': {'num': 150, 'acc': 0.44}, 'Biology': {'num': 30, 'acc': 0.53333}, 'Chemistry': {'num': 30, 'acc': 0.36667}, 'Geography': {'num': 30, 'acc': 0.53333}, 'Math': {'num': 30, 'acc': 0.26667}, 'Physics': {'num': 30, 'acc': 0.5}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.53333}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.66667}, 'Clinical_Medicine': {'num': 30, 'acc': 0.6}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.36667}, 'Pharmacy': {'num': 30, 'acc': 0.5}, 'Public_Health': {'num': 30, 'acc': 0.53333}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.675}, 'History': {'num': 30, 'acc': 0.63333}, 'Literature': {'num': 30, 'acc': 0.83333}, 'Sociology': {'num': 30, 'acc': 0.6}, 'Psychology': {'num': 30, 'acc': 0.63333}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.38571}, 'Agriculture': {'num': 30, 'acc': 0.53333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.2}, 'Computer_Science': {'num': 30, 'acc': 0.53333}, 'Electronics': {'num': 30, 'acc': 0.3}, 'Energy_and_Power': {'num': 30, 'acc': 0.4}, 'Materials': {'num': 30, 'acc': 0.33333}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.4}, 'Overall': {'num': 900, 'acc': 0.50889}}
2026-02-28 14:24:31 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:238 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=Qwen/Qwen3-VL-8B-Instruct), gen_kwargs: (), limit: None, offset: 0, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.5089|±  |N/A   |


Throughput Summary

|      Metric      |  Value   |  Unit  |
|------------------|---------:|--------|
|total_gen_tokens  |25464.0000|tokens  |
|total_elapsed_time| 2450.7231|seconds |
|avg_speed         |   10.3904|tokens/s|

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-28T14:05:33Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the asynchronous pipeline parallelism mechanism to introduce true asynchronous communication, replacing a previous "fake" async implementation. By introducing non-blocking send and receive operations (isend_tensor_dict and irecv_tensor_dict) and wrapping existing synchronous calls around these new primitives, the system can now effectively overlap communication and computation. This fundamental change aims to improve the efficiency of distributed tensor operations, leading to notable performance gains in the pipeline.

Highlights

Asynchronous Send Mechanism: Introduced isend_tensor_dict for non-blocking tensor dictionary sends, incorporating tensor.record_stream() to prevent premature CUDA tensor release.
Synchronous Send Wrapper: Refactored send_tensor_dict to act as a synchronous wrapper around the new isend_tensor_dict, maintaining API compatibility.
Asynchronous Receive Mechanism: Implemented irecv_tensor_dict for non-blocking tensor dictionary receives, returning tensor data, work handles, and post-processing functions for true async operation.
Synchronous Receive Wrapper: Transformed recv_tensor_dict into a synchronous wrapper that utilizes irecv_tensor_dict, handles, and post-processing.
Performance Improvement: Achieved a 5-7% performance improvement by enabling overlap of communication and computation in pipeline parallelism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/distributed/parallel_state.py
- Renamed send_tensor_dict to isend_tensor_dict and modified it to perform non-blocking sends, returning P2PWork handles.
- Added tensor.record_stream() to isend_tensor_dict for CUDA tensor stream synchronization.
- Introduced a new send_tensor_dict function as a synchronous wrapper around isend_tensor_dict.
- Created irecv_tensor_dict for non-blocking tensor dictionary receives, returning the tensor dictionary, work handles, and post-processing functions.
- Refactored recv_tensor_dict to be a synchronous wrapper that calls irecv_tensor_dict and waits for completion.
python/sglang/srt/managers/scheduler_pp_mixin.py
- Replaced direct calls to _pp_recv_proxy_tensors with _pp_irecv_proxy_tensors to initiate asynchronous receives.
- Introduced calls to _pp_wait_proxy_tensors to explicitly wait for the completion of asynchronous receives.
- Added new helper methods _pp_irecv_proxy_tensors, _pp_wait_proxy_tensors, _pp_irecv_dict_from_prev_stage, and _pp_wait_dict_from_prev_stage to manage the new asynchronous communication flow.

Activity

No human activity (comments, reviews, etc.) has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

yuan-luo · 2026-02-28T14:06:13Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request refactors the pipeline parallelism mechanism to be truly asynchronous, aiming to improve performance by overlapping communication and computation. The changes introduce isend_tensor_dict and irecv_tensor_dict for non-blocking communication, and the scheduler logic is updated accordingly. The new asynchronous API is well-designed and clear. However, I've identified two critical bugs in scheduler_pp_mixin.py where the return value of the new asynchronous receive function is not handled correctly. This would likely break the pipeline parallelism for disaggregated prefill and decode modes. I have provided specific comments with code suggestions to address these issues. After fixing these bugs, this refactoring should deliver the intended performance improvements.

yuan-luo · 2026-03-01T00:33:27Z

/rerun-failed-ci

yuan-luo · 2026-03-01T02:28:58Z

/rerun-failed-ci

yuan-luo · 2026-03-01T05:47:49Z

/rerun-failed-ci

yuan-luo · 2026-03-01T06:39:05Z

/rerun-failed-ci

yuan-luo · 2026-03-01T07:52:56Z

/rerun-failed-ci

yuan-luo · 2026-03-01T08:52:33Z

/rerun-failed-ci

yuan-luo · 2026-03-01T12:29:47Z

/rerun-failed-ci

yuan-luo · 2026-03-03T12:05:34Z

/rerun-failed-ci

yuan-luo · 2026-03-11T07:28:31Z

/rerun-failed-ci

ShangmingCai · 2026-03-14T06:04:13Z

I have seen many PP bug reports after the Spring Festival. I think these bugs might be related to the async commu after CP refactor and some modifications to the scheduler, and some race conditions. I have merged a fix PR #20341 that might fix some. Let wait and see if any bugs have popped out, if not, then we can optimize the performance further. Let's make it stable first.

Also, I think we don't need to change the send part, just focus on parallelizing the recv part.

ShangmingCai · 2026-03-14T06:04:42Z

We can come back to this PR later if no further bug has been reported.

yuan-luo · 2026-03-14T14:47:46Z

/rerun-failed-ci

yuan-luo · 2026-03-14T14:49:29Z

We can come back to this PR later if no further bug has been reported.

@ShangmingCai Sure, let's come back when async PP is more stable.

yuan-luo · 2026-03-25T02:26:52Z

/tag-and-rerun-ci

yuan-luo requested review from ShangmingCai, ch-wan, merrymercy and yizhang2077 as code owners February 28, 2026 14:05

yuan-luo requested review from Fridge003 and XucSh February 28, 2026 14:05

github-actions Bot added the run-ci label Feb 28, 2026

gemini-code-assist Bot reviewed Feb 28, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py Outdated

Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py Outdated

yuan-luo requested review from hnyls2002, liusy58 and whybeyoung February 28, 2026 14:08

yuan-luo force-pushed the refactor_async_pp branch 2 times, most recently from 4a2b2ae to 2e10e94 Compare February 28, 2026 14:41

yuan-luo force-pushed the refactor_async_pp branch from 2e10e94 to 703205b Compare March 1, 2026 06:19

yuan-luo mentioned this pull request Mar 2, 2026

feat(Qwen3.5): hybrid linear attention support PP+PD #19254

Open

5 tasks

yuan-luo force-pushed the refactor_async_pp branch from 703205b to 148cf53 Compare March 13, 2026 02:40

Refactor Async Pipeline Parallelism

a33bc95

yuan-luo force-pushed the refactor_async_pp branch from 148cf53 to a33bc95 Compare March 24, 2026 13:45

yuan-luo requested a review from BBuf March 28, 2026 01:15

Kangyan-Zhou mentioned this pull request Apr 28, 2026

ci: clean up stale-CUDA mooncake variant in install_extra_deps #23960

Merged

2 tasks

Conversation

yuan-luo commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Feb 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

yuan-luo commented Feb 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Mar 1, 2026

Uh oh!

yuan-luo commented Mar 1, 2026

Uh oh!

yuan-luo commented Mar 1, 2026

Uh oh!

yuan-luo commented Mar 1, 2026

Uh oh!

yuan-luo commented Mar 1, 2026

Uh oh!

yuan-luo commented Mar 1, 2026

Uh oh!

yuan-luo commented Mar 1, 2026

Uh oh!

yuan-luo commented Mar 3, 2026

Uh oh!

yuan-luo commented Mar 11, 2026

Uh oh!

ShangmingCai commented Mar 14, 2026

Uh oh!

ShangmingCai commented Mar 14, 2026

Uh oh!

yuan-luo commented Mar 14, 2026

Uh oh!

yuan-luo commented Mar 14, 2026

Uh oh!

yuan-luo commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuan-luo commented Feb 28, 2026 •

edited

Loading