[6/N] (Elastic EP) Recover failed ranks#15771
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
8ce7735 to
f620cab
Compare
f620cab to
50203d8
Compare
Skip EPLB rebalance after rank recovery and directly sync expert weights to recovered ranks instead of using EPLB which causes asymmetric P2P operations due to incorrect old_expert_location_metadata for recovered ranks. The root cause is that ExpertLocationUpdater relies on both old_expert_location_metadata and new_expert_location_metadata. For a recovered rank, the old metadata is stale/incorrect, leading to wrong P2P operations being calculated (e.g., rank 6 expects to send to rank 7, but rank 7 doesn't produce a corresponding irecv op). Instead of modifying expert_location_updater.py, the correct approach is to skip EPLB after 'EPLB due to rank faults' and let the expert weights be synced through normal Mooncake EP operation. See: sgl-project#15771
50203d8 to
3cd8a7f
Compare
3cd8a7f to
c1e0869
Compare
4ad0286 to
4fdd9fe
Compare
8477c94 to
081c992
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
081c992 to
6b2f08e
Compare
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for elastic Expert Parallel (EP) rank recovery and rejoining, primarily for the Mooncake backend. It adds a new --elastic-ep-rejoin CLI argument and implements the necessary logic to synchronize process groups, expert location metadata, and message queues when a rank is relaunched. The ModelRunner is updated to check for failed ranks and trigger recovery during the forward pass. Review feedback highlights critical issues regarding random seed synchronization: rejoining ranks must participate in the seed broadcast to avoid distributed hangs, and the broadcast source should be a verified healthy rank rather than a hardcoded rank 0, which might itself be undergoing recovery.
| if ( | ||
| self.server_args.elastic_ep_backend is not None | ||
| and self.server_args.elastic_ep_rejoin | ||
| ): | ||
| join_process_groups() | ||
| broadcast_global_expert_location_metadata( | ||
| src_rank=self._get_healthy_expert_location_src_rank( | ||
| invoked_in_elastic_ep_rejoin_path=True | ||
| ) | ||
| ) | ||
| ElasticEPStateManager.instance().reset() |
There was a problem hiding this comment.
The rejoining rank must participate in the random seed broadcast to stay in sync with the healthy ranks. Currently, healthy ranks perform this broadcast in maybe_recover_ep_ranks after a successful recovery, but the rejoining rank does not call it in its initialization path, which will lead to a distributed hang. Additionally, the source rank for the broadcast should be a healthy rank identified via _get_healthy_expert_location_src_rank.
if (
self.server_args.elastic_ep_backend is not None
and self.server_args.elastic_ep_rejoin
):
join_process_groups()
src_rank = self._get_healthy_expert_location_src_rank(
invoked_in_elastic_ep_rejoin_path=True
)
broadcast_global_expert_location_metadata(src_rank=src_rank)
# Synchronize random seed with healthy ranks
self.server_args.random_seed = broadcast_pyobj(
self.server_args.random_seed,
get_world_group().rank,
get_world_group().cpu_group,
src=get_world_group().ranks.index(src_rank),
)
ElasticEPStateManager.instance().reset()There was a problem hiding this comment.
False positive. The random-seed broadcast happens after ModelRunner initialization, so no need to execute it here.
| broadcast_global_expert_location_metadata( | ||
| src_rank=self._get_healthy_expert_location_src_rank( | ||
| invoked_in_elastic_ep_rejoin_path=False | ||
| ) | ||
| ) | ||
| ElasticEPStateManager.instance().reset() | ||
|
|
||
| broadcast_pyobj( | ||
| [self.server_args.random_seed], | ||
| get_world_group().rank, | ||
| get_world_group().cpu_group, | ||
| src=get_world_group().ranks[0], | ||
| ) |
There was a problem hiding this comment.
The source rank for the random seed broadcast is hardcoded to get_world_group().ranks[0]. If rank 0 is the one being recovered, it will broadcast its uninitialized or potentially inconsistent state to the healthy ranks. It should instead use the src_rank (and its corresponding index in the group) identified as healthy. Also, broadcast_pyobj expects the object itself, and its return value should be used to update the local state to ensure consistency across all ranks.
src_rank = self._get_healthy_expert_location_src_rank(
invoked_in_elastic_ep_rejoin_path=False
)
broadcast_global_expert_location_metadata(src_rank=src_rank)
ElasticEPStateManager.instance().reset()
self.server_args.random_seed = broadcast_pyobj(
self.server_args.random_seed,
get_world_group().rank,
get_world_group().cpu_group,
src=get_world_group().ranks.index(src_rank),
)|
This PR mainly brings incremental changes, and the logics are guarded with conditions so that non-elastic-ep paths will not be affected. There remains a minor performance issue regarding elastic EP recovery that will be addressed in follow-up PRs. |
ShangmingCai
left a comment
There was a problem hiding this comment.
Looks good. cc: @yizhang2077 @ch-wan
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
Wait, we have a new conflict. |
|
Current CI errors arise from irrelevant upstream regressions. |
…y [skip ci] # Conflicts: # python/sglang/srt/model_executor/model_runner.py

Motivation
As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput.
The core idea is as follows. When a node fails, the system admin can relaunch the node with an additional flag
--elastic-ep-rejoin. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new processes becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.flowchart TD X[Healthy processes] --> Y[Failed process found] Y --> A A[Normal Inference Iteration] --> B{Is new process ready?} B -- No --> C[Run inference normally] C --> A B -- Yes --> D[Join new process into process group] D --> C U[New process] -->E E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.] F --> H[Load weight & capture CUDA graph] H --> DModifications
--elastic-ep-rejointo mark a relaunched elastic EP rank that should rejoin an existing Mooncake process group.Accuracy Tests
Manual multi-node recovery test:
Node rank 0:
Node rank 1:
Confirm the service is healthy and serving requests.
Terminate the process on the larger-index node (
--node-rank 1).Relaunch that same node with exactly the same command line, only adding
--elastic-ep-rejoin:Expected result:
recover ranks [...] done.Benchmarking and Profiling
Checklist