Skip to content

[6/N] (Elastic EP) Recover failed ranks#15771

Merged
ch-wan merged 17 commits intosgl-project:mainfrom
HanHan009527:mooncake-pr-recovery
Apr 28, 2026
Merged

[6/N] (Elastic EP) Recover failed ranks#15771
ch-wan merged 17 commits intosgl-project:mainfrom
HanHan009527:mooncake-pr-recovery

Conversation

@UNIDY2002
Copy link
Copy Markdown
Contributor

@UNIDY2002 UNIDY2002 commented Dec 24, 2025

Motivation

As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput.

The core idea is as follows. When a node fails, the system admin can relaunch the node with an additional flag --elastic-ep-rejoin. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new processes becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.

flowchart TD
    X[Healthy processes] --> Y[Failed process found]
    Y --> A
    A[Normal Inference Iteration] --> B{Is new process ready?}
    B -- No --> C[Run inference normally]
    C --> A
    B -- Yes --> D[Join new process into process group]
    D --> C

    U[New process] -->E
    E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.]
    F --> H[Load weight & capture CUDA graph]
    H --> D
Loading

Modifications

  • Add a new server flag --elastic-ep-rejoin to mark a relaunched elastic EP rank that should rejoin an existing Mooncake process group.
  • Spread the recovered-rank state throughout distributed initialization so the process groups are created in recovery mode when needed.
  • Add elastic EP recovery helpers that poll peer readiness and recover all process groups.
  • Resynchronize expert placement metadata after recovery so recovered ranks observe the same global expert-location state as healthy ranks.
  • Reset elastic EP active-rank state and the EPLB generator after rank recovery so scheduling resumes from a consistent post-rejoin state.

Accuracy Tests

Manual multi-node recovery test:

  1. Start serving on 2 nodes with elastic EP enabled and enough redundant experts to tolerate one node failure.

Node rank 0:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 0

Node rank 1:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 1
  1. Confirm the service is healthy and serving requests.

  2. Terminate the process on the larger-index node (--node-rank 1).

  3. Relaunch that same node with exactly the same command line, only adding --elastic-ep-rejoin:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 1 --elastic-ep-rejoin
  1. Verify the relaunched rank rejoins successfully and the cluster returns to the healthy state.

Expected result:

  • The healthy node continues serving while the replacement rank initializes.
  • After the relaunched rank becomes ready, the process groups recover and the node rejoins successfully.
  • Logs on healthy ranks should show recovery completing, e.g. recover ranks [...] done.

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch 2 times, most recently from 8ce7735 to f620cab Compare January 25, 2026 14:44
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from f620cab to 50203d8 Compare March 5, 2026 06:37
UNIDY2002 added a commit to HanHan009527/sglang that referenced this pull request Mar 9, 2026
Skip EPLB rebalance after rank recovery and directly sync expert weights
to recovered ranks instead of using EPLB which causes asymmetric P2P
operations due to incorrect old_expert_location_metadata for recovered ranks.

The root cause is that ExpertLocationUpdater relies on both
old_expert_location_metadata and new_expert_location_metadata. For a
recovered rank, the old metadata is stale/incorrect, leading to wrong
P2P operations being calculated (e.g., rank 6 expects to send to rank 7,
but rank 7 doesn't produce a corresponding irecv op).

Instead of modifying expert_location_updater.py, the correct approach
is to skip EPLB after 'EPLB due to rank faults' and let the expert
weights be synced through normal Mooncake EP operation.

See: sgl-project#15771
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 50203d8 to 3cd8a7f Compare March 10, 2026 03:17
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 3cd8a7f to c1e0869 Compare March 17, 2026 06:18
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch 2 times, most recently from 4ad0286 to 4fdd9fe Compare March 28, 2026 08:21
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch 3 times, most recently from 8477c94 to 081c992 Compare April 7, 2026 05:44
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 7, 2026
@UNIDY2002 UNIDY2002 marked this pull request as ready for review April 7, 2026 06:21
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 081c992 to 6b2f08e Compare April 7, 2026 07:49
@UNIDY2002
Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for elastic Expert Parallel (EP) rank recovery and rejoining, primarily for the Mooncake backend. It adds a new --elastic-ep-rejoin CLI argument and implements the necessary logic to synchronize process groups, expert location metadata, and message queues when a rank is relaunched. The ModelRunner is updated to check for failed ranks and trigger recovery during the forward pass. Review feedback highlights critical issues regarding random seed synchronization: rejoining ranks must participate in the seed broadcast to avoid distributed hangs, and the broadcast source should be a verified healthy rank rather than a hardcoded rank 0, which might itself be undergoing recovery.

Comment on lines +489 to +499
if (
self.server_args.elastic_ep_backend is not None
and self.server_args.elastic_ep_rejoin
):
join_process_groups()
broadcast_global_expert_location_metadata(
src_rank=self._get_healthy_expert_location_src_rank(
invoked_in_elastic_ep_rejoin_path=True
)
)
ElasticEPStateManager.instance().reset()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The rejoining rank must participate in the random seed broadcast to stay in sync with the healthy ranks. Currently, healthy ranks perform this broadcast in maybe_recover_ep_ranks after a successful recovery, but the rejoining rank does not call it in its initialization path, which will lead to a distributed hang. Additionally, the source rank for the broadcast should be a healthy rank identified via _get_healthy_expert_location_src_rank.

        if (
            self.server_args.elastic_ep_backend is not None
            and self.server_args.elastic_ep_rejoin
        ):
            join_process_groups()
            src_rank = self._get_healthy_expert_location_src_rank(
                invoked_in_elastic_ep_rejoin_path=True
            )
            broadcast_global_expert_location_metadata(src_rank=src_rank)
            # Synchronize random seed with healthy ranks
            self.server_args.random_seed = broadcast_pyobj(
                self.server_args.random_seed,
                get_world_group().rank,
                get_world_group().cpu_group,
                src=get_world_group().ranks.index(src_rank),
            )
            ElasticEPStateManager.instance().reset()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive. The random-seed broadcast happens after ModelRunner initialization, so no need to execute it here.

Comment on lines +1487 to +1499
broadcast_global_expert_location_metadata(
src_rank=self._get_healthy_expert_location_src_rank(
invoked_in_elastic_ep_rejoin_path=False
)
)
ElasticEPStateManager.instance().reset()

broadcast_pyobj(
[self.server_args.random_seed],
get_world_group().rank,
get_world_group().cpu_group,
src=get_world_group().ranks[0],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The source rank for the random seed broadcast is hardcoded to get_world_group().ranks[0]. If rank 0 is the one being recovered, it will broadcast its uninitialized or potentially inconsistent state to the healthy ranks. It should instead use the src_rank (and its corresponding index in the group) identified as healthy. Also, broadcast_pyobj expects the object itself, and its return value should be used to update the local state to ensure consistency across all ranks.

            src_rank = self._get_healthy_expert_location_src_rank(
                invoked_in_elastic_ep_rejoin_path=False
            )
            broadcast_global_expert_location_metadata(src_rank=src_rank)
            ElasticEPStateManager.instance().reset()

            self.server_args.random_seed = broadcast_pyobj(
                self.server_args.random_seed,
                get_world_group().rank,
                get_world_group().cpu_group,
                src=get_world_group().ranks.index(src_rank),
            )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive.

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

UNIDY2002 commented Apr 17, 2026

This PR mainly brings incremental changes, and the logics are guarded with conditions so that non-elastic-ep paths will not be affected. There remains a minor performance issue regarding elastic EP recovery that will be addressed in follow-up PRs.

cc @ShangmingCai

Comment thread python/sglang/srt/model_executor/model_runner.py
Comment thread python/sglang/srt/distributed/parallel_state.py
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. cc: @yizhang2077 @ch-wan

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

截屏2026-04-23 09 14 28

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

Wait, we have a new conflict.

@UNIDY2002
Copy link
Copy Markdown
Contributor Author

Current CI errors arise from irrelevant upstream regressions.

UNIDY2002 and others added 2 commits April 23, 2026 12:45
@ch-wan ch-wan merged commit 9a53ab3 into sgl-project:main Apr 28, 2026
1 check passed
@UNIDY2002 UNIDY2002 deleted the mooncake-pr-recovery branch April 28, 2026 07:46
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants