Skip to content

[PP] Refactor Async Pipeline Parallelism#19582

Open
yuan-luo wants to merge 1 commit intosgl-project:mainfrom
antgroup:refactor_async_pp
Open

[PP] Refactor Async Pipeline Parallelism#19582
yuan-luo wants to merge 1 commit intosgl-project:mainfrom
antgroup:refactor_async_pp

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Feb 28, 2026

Motivation

Current the async mode pipeline parallelism mechanism is a sort of "fake" async: In recv_tensor_dict, each tensor is immediately waited on after calling irecv, resulting in serial reception and preventing overlap of communication and computation.

This PR refactors an authentic async PP mechanism with the following design:

  • Introduce isend_tensor_dict: Send all isend calls and return a List[P2PWork]. The key addition is tensor.record_stream() to prevent CUDA tensors from being prematurely released (this exists in vllm but was not in the original sglang).

  • Rewrite send_tensor_dict: Transform it into a simple wrapper around isend_tensor_dict + wait, maintaining API compatibility.

  • Introduce irecv_tensor_dict: Initiate all irecv calls at once, returning a tuple of (tensor_dict, handles, postprocess). The postprocess is used for reconstructing in all_gather.

  • Rewrite recv_tensor_dict: Transform it into a wrapper that combines irecv_tensor_dict + wait + postprocess.

The performance improved 5-7%. More comprehensive performance test will be done.

Modifications

Accuracy Tests

Acc no drops:
Server:

➜  python git:(refactor_async_pp) ✗ python -m sglang.launch_server --model-path Qwen/Qwen3-VL-8B-Instruct --tp-size 2 --pp-size 2

Client

➜  sglang_dev git:(refactor_async_pp) ✗ OPENAI_API_BASE=http://0.0.0.0:30000/v1 OPENAI_API_KEY="" python3 -m lmms_eval     --model openai_compatible     --model_args model_version=Qwen/Qwen3-VL-8B-Instruct  --tasks mmmu_val     --batch_size 16
2026-02-28 14:22:18 | INFO     | __main__:cli_evaluate:476 - Verbosity set to INFO
2026-02-28 14:22:21 | INFO     | __main__:cli_evaluate_single:565 - Evaluation tracker args: {}
2026-02-28 14:22:21 | INFO     | __main__:cli_evaluate_single:649 - Selected Tasks: ['mmmu_val']
2026-02-28 14:22:21 | INFO     | lmms_eval.evaluator:simple_evaluate:170 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2026-02-28 14:22:22 | INFO     | lmms_eval.evaluator:evaluate:515 - Running on rank 0 (local rank 0)
2026-02-28 14:22:22 | INFO     | lmms_eval.api.task:build_all_requests:428 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 13380.05it/s]
2026-02-28 14:22:22 | INFO     | lmms_eval.evaluator:evaluate:609 - Running generate_until requests
Model Responding: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 899/900 [02:08<00:00,  9.95it/s]2026-02-28 14:24:30 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:136 - Metric summary - Total elapsed time: 2450.723s, Total gen tokens: 25464, Avg speed: 10.4 tokens/s
Model Responding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [02:08<00:00,  7.01it/s]
Postprocessing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 9727.30it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.7}, 'Art': {'num': 30, 'acc': 0.73333}, 'Art_Theory': {'num': 30, 'acc': 0.9}, 'Design': {'num': 30, 'acc': 0.73333}, 'Music': {'num': 30, 'acc': 0.43333}, 'Overall-Business': {'num': 150, 'acc': 0.44}, 'Accounting': {'num': 30, 'acc': 0.4}, 'Economics': {'num': 30, 'acc': 0.56667}, 'Finance': {'num': 30, 'acc': 0.23333}, 'Manage': {'num': 30, 'acc': 0.43333}, 'Marketing': {'num': 30, 'acc': 0.56667}, 'Overall-Science': {'num': 150, 'acc': 0.44}, 'Biology': {'num': 30, 'acc': 0.53333}, 'Chemistry': {'num': 30, 'acc': 0.36667}, 'Geography': {'num': 30, 'acc': 0.53333}, 'Math': {'num': 30, 'acc': 0.26667}, 'Physics': {'num': 30, 'acc': 0.5}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.53333}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.66667}, 'Clinical_Medicine': {'num': 30, 'acc': 0.6}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.36667}, 'Pharmacy': {'num': 30, 'acc': 0.5}, 'Public_Health': {'num': 30, 'acc': 0.53333}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.675}, 'History': {'num': 30, 'acc': 0.63333}, 'Literature': {'num': 30, 'acc': 0.83333}, 'Sociology': {'num': 30, 'acc': 0.6}, 'Psychology': {'num': 30, 'acc': 0.63333}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.38571}, 'Agriculture': {'num': 30, 'acc': 0.53333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.2}, 'Computer_Science': {'num': 30, 'acc': 0.53333}, 'Electronics': {'num': 30, 'acc': 0.3}, 'Energy_and_Power': {'num': 30, 'acc': 0.4}, 'Materials': {'num': 30, 'acc': 0.33333}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.4}, 'Overall': {'num': 900, 'acc': 0.50889}}
2026-02-28 14:24:31 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:238 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=Qwen/Qwen3-VL-8B-Instruct), gen_kwargs: (), limit: None, offset: 0, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.5089|±  |N/A   |


Throughput Summary

|      Metric      |  Value   |  Unit  |
|------------------|---------:|--------|
|total_gen_tokens  |25464.0000|tokens  |
|total_elapsed_time| 2450.7231|seconds |
|avg_speed         |   10.3904|tokens/s|

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the asynchronous pipeline parallelism mechanism to introduce true asynchronous communication, replacing a previous "fake" async implementation. By introducing non-blocking send and receive operations (isend_tensor_dict and irecv_tensor_dict) and wrapping existing synchronous calls around these new primitives, the system can now effectively overlap communication and computation. This fundamental change aims to improve the efficiency of distributed tensor operations, leading to notable performance gains in the pipeline.

Highlights

  • Asynchronous Send Mechanism: Introduced isend_tensor_dict for non-blocking tensor dictionary sends, incorporating tensor.record_stream() to prevent premature CUDA tensor release.
  • Synchronous Send Wrapper: Refactored send_tensor_dict to act as a synchronous wrapper around the new isend_tensor_dict, maintaining API compatibility.
  • Asynchronous Receive Mechanism: Implemented irecv_tensor_dict for non-blocking tensor dictionary receives, returning tensor data, work handles, and post-processing functions for true async operation.
  • Synchronous Receive Wrapper: Transformed recv_tensor_dict into a synchronous wrapper that utilizes irecv_tensor_dict, handles, and post-processing.
  • Performance Improvement: Achieved a 5-7% performance improvement by enabling overlap of communication and computation in pipeline parallelism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/distributed/parallel_state.py
    • Renamed send_tensor_dict to isend_tensor_dict and modified it to perform non-blocking sends, returning P2PWork handles.
    • Added tensor.record_stream() to isend_tensor_dict for CUDA tensor stream synchronization.
    • Introduced a new send_tensor_dict function as a synchronous wrapper around isend_tensor_dict.
    • Created irecv_tensor_dict for non-blocking tensor dictionary receives, returning the tensor dictionary, work handles, and post-processing functions.
    • Refactored recv_tensor_dict to be a synchronous wrapper that calls irecv_tensor_dict and waits for completion.
  • python/sglang/srt/managers/scheduler_pp_mixin.py
    • Replaced direct calls to _pp_recv_proxy_tensors with _pp_irecv_proxy_tensors to initiate asynchronous receives.
    • Introduced calls to _pp_wait_proxy_tensors to explicitly wait for the completion of asynchronous receives.
    • Added new helper methods _pp_irecv_proxy_tensors, _pp_wait_proxy_tensors, _pp_irecv_dict_from_prev_stage, and _pp_wait_dict_from_prev_stage to manage the new asynchronous communication flow.
Activity
  • No human activity (comments, reviews, etc.) has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo yuan-luo requested review from Fridge003 and XucSh February 28, 2026 14:05
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the pipeline parallelism mechanism to be truly asynchronous, aiming to improve performance by overlapping communication and computation. The changes introduce isend_tensor_dict and irecv_tensor_dict for non-blocking communication, and the scheduler logic is updated accordingly. The new asynchronous API is well-designed and clear. However, I've identified two critical bugs in scheduler_pp_mixin.py where the return value of the new asynchronous receive function is not handled correctly. This would likely break the pipeline parallelism for disaggregated prefill and decode modes. I have provided specific comments with code suggestions to address these issues. After fixing these bugs, this refactoring should deliver the intended performance improvements.

Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py Outdated
Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py Outdated
@yuan-luo yuan-luo force-pushed the refactor_async_pp branch 2 times, most recently from 4a2b2ae to 2e10e94 Compare February 28, 2026 14:41
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 1, 2026

/rerun-failed-ci

2 similar comments
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 1, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 1, 2026

/rerun-failed-ci

@yuan-luo yuan-luo force-pushed the refactor_async_pp branch from 2e10e94 to 703205b Compare March 1, 2026 06:19
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 1, 2026

/rerun-failed-ci

3 similar comments
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 1, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 1, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 1, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 3, 2026

/rerun-failed-ci

1 similar comment
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo yuan-luo force-pushed the refactor_async_pp branch from 703205b to 148cf53 Compare March 13, 2026 02:40
@ShangmingCai
Copy link
Copy Markdown
Collaborator

I have seen many PP bug reports after the Spring Festival. I think these bugs might be related to the async commu after CP refactor and some modifications to the scheduler, and some race conditions. I have merged a fix PR #20341 that might fix some. Let wait and see if any bugs have popped out, if not, then we can optimize the performance further. Let's make it stable first.

Also, I think we don't need to change the send part, just focus on parallelizing the recv part.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

We can come back to this PR later if no further bug has been reported.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

We can come back to this PR later if no further bug has been reported.

@ShangmingCai Sure, let's come back when async PP is more stable.

@yuan-luo yuan-luo force-pushed the refactor_async_pp branch from 148cf53 to a33bc95 Compare March 24, 2026 13:45
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants