[PP] Refactor Async Pipeline Parallelism#19582
[PP] Refactor Async Pipeline Parallelism#19582yuan-luo wants to merge 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the asynchronous pipeline parallelism mechanism to introduce true asynchronous communication, replacing a previous "fake" async implementation. By introducing non-blocking send and receive operations ( Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request refactors the pipeline parallelism mechanism to be truly asynchronous, aiming to improve performance by overlapping communication and computation. The changes introduce isend_tensor_dict and irecv_tensor_dict for non-blocking communication, and the scheduler logic is updated accordingly. The new asynchronous API is well-designed and clear. However, I've identified two critical bugs in scheduler_pp_mixin.py where the return value of the new asynchronous receive function is not handled correctly. This would likely break the pipeline parallelism for disaggregated prefill and decode modes. I have provided specific comments with code suggestions to address these issues. After fixing these bugs, this refactoring should deliver the intended performance improvements.
4a2b2ae to
2e10e94
Compare
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
2e10e94 to
703205b
Compare
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
703205b to
148cf53
Compare
|
I have seen many PP bug reports after the Spring Festival. I think these bugs might be related to the async commu after CP refactor and some modifications to the scheduler, and some race conditions. I have merged a fix PR #20341 that might fix some. Let wait and see if any bugs have popped out, if not, then we can optimize the performance further. Let's make it stable first. Also, I think we don't need to change the send part, just focus on parallelizing the recv part. |
|
We can come back to this PR later if no further bug has been reported. |
|
/rerun-failed-ci |
@ShangmingCai Sure, let's come back when async PP is more stable. |
148cf53 to
a33bc95
Compare
|
/tag-and-rerun-ci |
Motivation
Current the async mode pipeline parallelism mechanism is a sort of "fake" async: In
recv_tensor_dict, each tensor is immediately waited on after calling irecv, resulting in serial reception and preventing overlap of communication and computation.This PR refactors an authentic async PP mechanism with the following design:
Introduce isend_tensor_dict: Send all isend calls and return a List[P2PWork]. The key addition is tensor.record_stream() to prevent CUDA tensors from being prematurely released (this exists in vllm but was not in the original sglang).
Rewrite send_tensor_dict: Transform it into a simple wrapper around isend_tensor_dict + wait, maintaining API compatibility.
Introduce irecv_tensor_dict: Initiate all irecv calls at once, returning a tuple of (tensor_dict, handles, postprocess). The postprocess is used for reconstructing in all_gather.
Rewrite recv_tensor_dict: Transform it into a wrapper that combines irecv_tensor_dict + wait + postprocess.
The performance improved 5-7%. More comprehensive performance test will be done.
Modifications
Accuracy Tests
Acc no drops:
Server:
Client
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci