[5/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible#12068
Conversation
…ring EPLB as much as possible
Summary of ChangesHello @UNIDY2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial optimization to the Elastic EP (Expert Parallelism) rebalancing mechanism. By intelligently utilizing GPU Peer-to-Peer (P2P) communication, the system can now exchange expert weights between active GPUs directly. This change drastically reduces the need for full weight reloads from disk during rebalancing, leading to a substantial improvement in performance, as evidenced by the reported 88% reduction in rebalancing time. The ExpertLocationUpdater has been enhanced to manage this process, ensuring that only truly missing experts are reloaded from storage. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant optimization for Elastic Expert Parallelism by using GPU P2P communication to exchange expert weights during rebalancing, falling back to disk loading only for weights on failed peers. This drastically reduces rebalancing time as shown by the unit test results. The implementation is well-structured, correctly identifying and handling missing experts. I've added one suggestion to further optimize the logic for filtering missing expert weights, which will improve performance and readability.
…ght_name_filter` method
# Conflicts: # python/sglang/srt/model_executor/model_runner.py
| @@ -325,6 +345,10 @@ def _create_p2p_recv_and_buffer2weight_copy( | |||
| src_rank: int, | |||
| dst_expert_location: int, | |||
| ): | |||
| if not active_ranks[src_rank]: | |||
| # The logical expert cannot be loaded from peers | |||
| missing_logical_experts.append(logical_expert_id) | |||
| return | |||
| p2p_op_infos.append( | |||
| ( | |||
| logical_expert_id, | |||
| @@ -385,6 +409,7 @@ def _create_isend_ops_of_logical_expert_id( | |||
| peer=dst_rank, | |||
| ) | |||
| for dst_rank in all_dst_ranks | |||
| if active_ranks[dst_rank] | |||
| for i in range(num_tensors) | |||
| ], | |||
| ) | |||
| @@ -467,7 +492,7 @@ def _get_local_expert_location(expert_location: int) -> int: | |||
|
|
|||
| _entrypoint() | |||
|
|
|||
| return output_logs | |||
| return output_logs, missing_logical_experts | |||
There was a problem hiding this comment.
Is it possible that we replace these change with a more clear implementation? I find that I need to view the whole file to understand what this block is doing, LOL.
There was a problem hiding this comment.
Yes. I moved those logic to a function _filter_p2p_ops. In addition, I changed the data structure of missing_logical_experts_by_layers to a map, which is more intuitive.
ShangmingCai
left a comment
There was a problem hiding this comment.
Overall, LGTM for the logic, only some nits.
|
Most of the CI have passed. Do we need to rerun the remaining tests? |
|
/rerun-failed-ci |
… as much as possible (sgl-project#12068) Co-authored-by: Hank Han <hanhan.hank@bytedance.com> Co-authored-by: Hank Han <hanhan7630@outlook.com>
… as much as possible (sgl-project#12068) Co-authored-by: Hank Han <hanhan.hank@bytedance.com> Co-authored-by: Hank Han <hanhan7630@outlook.com>
… as much as possible (sgl-project#12068) Co-authored-by: Hank Han <hanhan.hank@bytedance.com> Co-authored-by: Hank Han <hanhan7630@outlook.com>
… as much as possible (sgl-project#12068) Co-authored-by: Hank Han <hanhan.hank@bytedance.com> Co-authored-by: Hank Han <hanhan7630@outlook.com>

Motivation
In the Elastic EP core structure introduced in #10606, a full weight reloading from the disk was required during rebalancing, when Elastic EP is enabled.
This PR optimizes the part by applying GPU P2P communication as much as possible, reusing the
ExpertLocationUpdaterimplemented by @fzyzcjy. Only the weights on dead peers have to be reloaded from the disk.Unit test
ep.test_mooncake_ep_small.TestMooncakeWithEPLBshows a decrease in rebalancing time from ~1.7s to ~0.2s.Modifications
The major changes are in
expert_location_updater.py. The logical experts that could not be loaded from the peers are recorded inall_missing_logical_experts. The ModelRunner will then load these weights from the disk.Accuracy Tests
All existing unit-tests should pass.
Benchmarking and Profiling
Unit test
ep.test_mooncake_ep_small.TestMooncakeWithEPLBshows a decrease in rebalancing time from ~1.7s to ~0.2s.All existing performance tests should pass.
Checklist