Skip to content

[5/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible#12068

Merged
ShangmingCai merged 28 commits intosgl-project:mainfrom
HanHan009527:elastic-eplb-p2p
Mar 16, 2026
Merged

[5/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible#12068
ShangmingCai merged 28 commits intosgl-project:mainfrom
HanHan009527:elastic-eplb-p2p

Conversation

@UNIDY2002
Copy link
Copy Markdown
Contributor

Motivation

In the Elastic EP core structure introduced in #10606, a full weight reloading from the disk was required during rebalancing, when Elastic EP is enabled.

This PR optimizes the part by applying GPU P2P communication as much as possible, reusing the ExpertLocationUpdater implemented by @fzyzcjy. Only the weights on dead peers have to be reloaded from the disk.

Unit test ep.test_mooncake_ep_small.TestMooncakeWithEPLB shows a decrease in rebalancing time from ~1.7s to ~0.2s.

Modifications

The major changes are in expert_location_updater.py. The logical experts that could not be loaded from the peers are recorded in all_missing_logical_experts. The ModelRunner will then load these weights from the disk.

Accuracy Tests

All existing unit-tests should pass.

Benchmarking and Profiling

Unit test ep.test_mooncake_ep_small.TestMooncakeWithEPLB shows a decrease in rebalancing time from ~1.7s to ~0.2s.

All existing performance tests should pass.

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @UNIDY2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial optimization to the Elastic EP (Expert Parallelism) rebalancing mechanism. By intelligently utilizing GPU Peer-to-Peer (P2P) communication, the system can now exchange expert weights between active GPUs directly. This change drastically reduces the need for full weight reloads from disk during rebalancing, leading to a substantial improvement in performance, as evidenced by the reported 88% reduction in rebalancing time. The ExpertLocationUpdater has been enhanced to manage this process, ensuring that only truly missing experts are reloaded from storage.

Highlights

  • Optimized Elastic EP Rebalancing: Implements GPU P2P communication for expert weight exchange, significantly reducing rebalancing time by avoiding full disk reloads.
  • Reduced Disk I/O: Only weights from 'dead' or inactive peers are reloaded from disk, drastically cutting down on I/O operations during rebalancing.
  • Enhanced ExpertLocationUpdater: The updater now intelligently identifies and manages 'all_missing_logical_experts', triggering targeted disk reloads only for these specific weights.
  • Performance Improvement: Unit tests demonstrate a substantial reduction in rebalancing time, decreasing from approximately 1.7 seconds to 0.2 seconds.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant optimization for Elastic Expert Parallelism by using GPU P2P communication to exchange expert weights during rebalancing, falling back to disk loading only for weights on failed peers. This drastically reduces rebalancing time as shown by the unit test results. The implementation is well-structured, correctly identifying and handling missing experts. I've added one suggestion to further optimize the logic for filtering missing expert weights, which will improve performance and readability.

Comment thread python/sglang/srt/eplb/expert_location_updater.py Outdated
@b8zhong b8zhong added the run-ci label Oct 24, 2025
@UNIDY2002 UNIDY2002 changed the title [4/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible [5/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible Oct 25, 2025
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits

Comment thread python/sglang/srt/eplb/expert_location_updater.py Outdated
@UNIDY2002 UNIDY2002 requested a review from fzyzcjy November 17, 2025 03:33
Comment thread python/sglang/srt/elastic_ep/expert_backup_client.py Outdated
Comment on lines +220 to +495
@@ -325,6 +345,10 @@ def _create_p2p_recv_and_buffer2weight_copy(
src_rank: int,
dst_expert_location: int,
):
if not active_ranks[src_rank]:
# The logical expert cannot be loaded from peers
missing_logical_experts.append(logical_expert_id)
return
p2p_op_infos.append(
(
logical_expert_id,
@@ -385,6 +409,7 @@ def _create_isend_ops_of_logical_expert_id(
peer=dst_rank,
)
for dst_rank in all_dst_ranks
if active_ranks[dst_rank]
for i in range(num_tensors)
],
)
@@ -467,7 +492,7 @@ def _get_local_expert_location(expert_location: int) -> int:

_entrypoint()

return output_logs
return output_logs, missing_logical_experts
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we replace these change with a more clear implementation? I find that I need to view the whole file to understand what this block is doing, LOL.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I moved those logic to a function _filter_p2p_ops. In addition, I changed the data structure of missing_logical_experts_by_layers to a map, which is more intuitive.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM for the logic, only some nits.

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

Most of the CI have passed. Do we need to rerun the remaining tests?

Comment thread python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py
Comment thread python/sglang/srt/eplb/expert_location_updater.py Outdated
Comment thread python/sglang/srt/eplb/expert_location_updater.py Outdated
@UNIDY2002 UNIDY2002 requested a review from ShangmingCai March 13, 2026 11:21
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

The failed tests seems to be irrelevant. Need a double check?
截屏2026-03-16 18 03 01

@ShangmingCai ShangmingCai merged commit 549fbcc into sgl-project:main Mar 16, 2026
344 of 394 checks passed
@UNIDY2002 UNIDY2002 deleted the elastic-eplb-p2p branch March 16, 2026 11:02
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
… as much as possible (sgl-project#12068)

Co-authored-by: Hank Han <hanhan.hank@bytedance.com>
Co-authored-by: Hank Han <hanhan7630@outlook.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
… as much as possible (sgl-project#12068)

Co-authored-by: Hank Han <hanhan.hank@bytedance.com>
Co-authored-by: Hank Han <hanhan7630@outlook.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
… as much as possible (sgl-project#12068)

Co-authored-by: Hank Han <hanhan.hank@bytedance.com>
Co-authored-by: Hank Han <hanhan7630@outlook.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
… as much as possible (sgl-project#12068)

Co-authored-by: Hank Han <hanhan.hank@bytedance.com>
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants