[5/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible by UNIDY2002 · Pull Request #12068 · sgl-project/sglang

UNIDY2002 · 2025-10-24T08:51:24Z

Motivation

In the Elastic EP core structure introduced in #10606, a full weight reloading from the disk was required during rebalancing, when Elastic EP is enabled.

This PR optimizes the part by applying GPU P2P communication as much as possible, reusing the ExpertLocationUpdater implemented by @fzyzcjy. Only the weights on dead peers have to be reloaded from the disk.

Unit test ep.test_mooncake_ep_small.TestMooncakeWithEPLB shows a decrease in rebalancing time from ~1.7s to ~0.2s.

Modifications

The major changes are in expert_location_updater.py. The logical experts that could not be loaded from the peers are recorded in all_missing_logical_experts. The ModelRunner will then load these weights from the disk.

Accuracy Tests

All existing unit-tests should pass.

Benchmarking and Profiling

Unit test ep.test_mooncake_ep_small.TestMooncakeWithEPLB shows a decrease in rebalancing time from ~1.7s to ~0.2s.

All existing performance tests should pass.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…ring EPLB as much as possible

gemini-code-assist · 2025-10-24T08:51:47Z

Summary of Changes

Hello @UNIDY2002, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial optimization to the Elastic EP (Expert Parallelism) rebalancing mechanism. By intelligently utilizing GPU Peer-to-Peer (P2P) communication, the system can now exchange expert weights between active GPUs directly. This change drastically reduces the need for full weight reloads from disk during rebalancing, leading to a substantial improvement in performance, as evidenced by the reported 88% reduction in rebalancing time. The ExpertLocationUpdater has been enhanced to manage this process, ensuring that only truly missing experts are reloaded from storage.

Highlights

Optimized Elastic EP Rebalancing: Implements GPU P2P communication for expert weight exchange, significantly reducing rebalancing time by avoiding full disk reloads.
Reduced Disk I/O: Only weights from 'dead' or inactive peers are reloaded from disk, drastically cutting down on I/O operations during rebalancing.
Enhanced ExpertLocationUpdater: The updater now intelligently identifies and manages 'all_missing_logical_experts', triggering targeted disk reloads only for these specific weights.
Performance Improvement: Unit tests demonstrate a substantial reduction in rebalancing time, decreasing from approximately 1.7 seconds to 0.2 seconds.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant optimization for Elastic Expert Parallelism by using GPU P2P communication to exchange expert weights during rebalancing, falling back to disk loading only for weights on failed peers. This drastically reduces rebalancing time as shown by the unit test results. The implementation is well-structured, correctly identifying and handling missing experts. I've added one suggestion to further optimize the logic for filtering missing expert weights, which will improve performance and readability.

fzyzcjy

nits

…ght_name_filter` method

# Conflicts: # python/sglang/srt/model_executor/model_runner.py

ShangmingCai · 2026-03-05T09:31:17Z

@@ -325,6 +345,10 @@ def _create_p2p_recv_and_buffer2weight_copy(
        src_rank: int,
        dst_expert_location: int,
    ):
+        if not active_ranks[src_rank]:
+            # The logical expert cannot be loaded from peers
+            missing_logical_experts.append(logical_expert_id)
+            return
        p2p_op_infos.append(
            (
                logical_expert_id,
@@ -385,6 +409,7 @@ def _create_isend_ops_of_logical_expert_id(
                        peer=dst_rank,
                    )
                    for dst_rank in all_dst_ranks
+                    if active_ranks[dst_rank]
                    for i in range(num_tensors)
                ],
            )
@@ -467,7 +492,7 @@ def _get_local_expert_location(expert_location: int) -> int:

    _entrypoint()

-    return output_logs
+    return output_logs, missing_logical_experts


Is it possible that we replace these change with a more clear implementation? I find that I need to view the whole file to understand what this block is doing, LOL.

Yes. I moved those logic to a function _filter_p2p_ops. In addition, I changed the data structure of missing_logical_experts_by_layers to a map, which is more intuitive.

ShangmingCai

Overall, LGTM for the logic, only some nits.

UNIDY2002 · 2026-03-10T01:49:05Z

Most of the CI have passed. Do we need to rerun the remaining tests?

ShangmingCai

LGTM

ShangmingCai · 2026-03-15T09:32:23Z

/rerun-failed-ci

UNIDY2002 · 2026-03-16T10:03:25Z

The failed tests seems to be irrelevant. Need a double check?

… as much as possible (sgl-project#12068) Co-authored-by: Hank Han <hanhan.hank@bytedance.com> Co-authored-by: Hank Han <hanhan7630@outlook.com>

When elastic EP is enabled, use GPU P2P to exchange expert weights du…

817804a

…ring EPLB as much as possible

UNIDY2002 requested review from Ying1123, fzyzcjy, hnyls2002, ispobock, merrymercy, yizhang2077 and zhyncs as code owners October 24, 2025 08:51

gemini-code-assist Bot reviewed Oct 24, 2025

View reviewed changes

Comment thread python/sglang/srt/eplb/expert_location_updater.py Outdated

UNIDY2002 added 2 commits October 24, 2025 16:57

Lint

76b1fc1

Type-check

1fecd58

b8zhong added the run-ci label Oct 24, 2025

UNIDY2002 changed the title ~~[4/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible~~ [5/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible Oct 25, 2025

UNIDY2002 mentioned this pull request Oct 29, 2025

Elastic EP Support (Milestone 1 & 2) #8961

Closed

6 tasks

ShangmingCai assigned fzyzcjy and yizhang2077 Nov 5, 2025

fzyzcjy reviewed Nov 10, 2025

View reviewed changes

Comment thread python/sglang/srt/eplb/expert_location_updater.py Outdated

Generate missing_experts_name_filter with a per-model `generate_wei…

537f5e6

…ght_name_filter` method

UNIDY2002 requested a review from Fridge003 as a code owner November 11, 2025 09:07

github-actions Bot added the deepseek label Nov 11, 2025

UNIDY2002 and others added 8 commits November 11, 2025 17:24

Fix format and add doc

9cd638b

Merge branch 'main' into elastic-eplb-p2p

a9f941e

Fix format

42e2860

Fix

5c1ce5a

Merge branch 'main' into elastic-eplb-p2p

5c0f86c

Merge branch 'main' into elastic-eplb-p2p

ca04d80

Merge branch 'main' into elastic-eplb-p2p

47f76bc

Merge branch 'main' into elastic-eplb-p2p

3d269e0

UNIDY2002 requested a review from fzyzcjy November 17, 2025 03:33

UNIDY2002 mentioned this pull request Jan 22, 2026

[Help Wanted] High-performance send/recv implementation for Mooncake PG kvcache-ai/Mooncake#1421

Closed

UNIDY2002 added 3 commits February 27, 2026 16:04

Merge remote-tracking branch 'upstream/main' into elastic-eplb-p2p

f2278e4

# Conflicts: # python/sglang/srt/model_executor/model_runner.py

Code update

4fd0897

Format

e2387ae

UNIDY2002 mentioned this pull request Mar 5, 2026

[PG] Share P2PProxy/ConnectionPoller threads across backends. kvcache-ai/Mooncake#1607

Merged

21 tasks

ShangmingCai reviewed Mar 5, 2026

View reviewed changes

Comment thread python/sglang/srt/elastic_ep/expert_backup_client.py Outdated

ShangmingCai reviewed Mar 5, 2026

View reviewed changes

UNIDY2002 and others added 4 commits March 5, 2026 22:50

Dev

8694b94

Lint

73b4680

Merge branch 'main' into elastic-eplb-p2p

e6ba180

Merge branch 'main' into elastic-eplb-p2p

37718c3

ShangmingCai reviewed Mar 10, 2026

View reviewed changes

Comment thread python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py

ShangmingCai reviewed Mar 10, 2026

View reviewed changes

Comment thread python/sglang/srt/eplb/expert_location_updater.py Outdated

ShangmingCai reviewed Mar 10, 2026

View reviewed changes

Comment thread python/sglang/srt/eplb/expert_location_updater.py Outdated

HanHan009527 and others added 2 commits March 11, 2026 11:15

fix(eplb): move update docstring and guard mixed p2p ops (#15)

4419402

style: format with black

9b465cb

UNIDY2002 requested a review from ShangmingCai March 13, 2026 11:21

ShangmingCai approved these changes Mar 13, 2026

View reviewed changes

UNIDY2002 and others added 2 commits March 14, 2026 09:06

Merge branch 'main' into elastic-eplb-p2p

1a3c288

Merge branch 'main' into elastic-eplb-p2p

4396c80

ShangmingCai merged commit 549fbcc into sgl-project:main Mar 16, 2026
344 of 394 checks passed

UNIDY2002 deleted the elastic-eplb-p2p branch March 16, 2026 11:02

Conversation

UNIDY2002 commented Oct 24, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Oct 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ShangmingCai Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

UNIDY2002 Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

UNIDY2002 commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Mar 15, 2026

Uh oh!

UNIDY2002 commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants