[6/N] (Elastic EP) Recover failed ranks by UNIDY2002 · Pull Request #15771 · sgl-project/sglang

UNIDY2002 · 2025-12-24T14:54:42Z

Motivation

As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput.

The core idea is as follows. When a node fails, the system admin can relaunch the node with an additional flag --elastic-ep-rejoin. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new processes becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.

flowchart TD
    X[Healthy processes] --> Y[Failed process found]
    Y --> A
    A[Normal Inference Iteration] --> B{Is new process ready?}
    B -- No --> C[Run inference normally]
    C --> A
    B -- Yes --> D[Join new process into process group]
    D --> C

    U[New process] -->E
    E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.]
    F --> H[Load weight & capture CUDA graph]
    H --> D

Modifications

Add a new server flag --elastic-ep-rejoin to mark a relaunched elastic EP rank that should rejoin an existing Mooncake process group.
Spread the recovered-rank state throughout distributed initialization so the process groups are created in recovery mode when needed.
Add elastic EP recovery helpers that poll peer readiness and recover all process groups.
Resynchronize expert placement metadata after recovery so recovered ranks observe the same global expert-location state as healthy ranks.
Reset elastic EP active-rank state and the EPLB generator after rank recovery so scheduling resumes from a consistent post-rejoin state.

Accuracy Tests

Manual multi-node recovery test:

Start serving on 2 nodes with elastic EP enabled and enough redundant experts to tolerate one node failure.

Node rank 0:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 0

Node rank 1:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 1

Confirm the service is healthy and serving requests.
Terminate the process on the larger-index node (--node-rank 1).
Relaunch that same node with exactly the same command line, only adding --elastic-ep-rejoin:

sglang serve --model-path <model-path> --trust-remote-code --tp 16 --elastic-ep-backend mooncake --mooncake-ib-device <ib-device-list> --moe-a2a-backend mooncake --deepep-mode low_latency --moe-dense-tp-size 1 --enable-dp-lm-head --enable-two-batch-overlap --disable-custom-all-reduce --enable-eplb --ep-num-redundant-experts <num-redundant-experts> --chunked-prefill-size 512 --cuda-graph-max-bs 16 --max-running-requests 512 --mem-fraction-static 0.5 --enable-dp-attention --dp 16 --device cuda --host 127.0.0.1 --dist-init-addr <ip:port> --port 21000 --nnodes 2 --node-rank 1 --elastic-ep-rejoin

Verify the relaunched rank rejoins successfully and the cluster returns to the healthy state.

Expected result:

The healthy node continues serving while the replacement rank initializes.
After the relaunched rank becomes ready, the process groups recover and the node rejoins successfully.
Logs on healthy ranks should show recovery completing, e.g. recover ranks [...] done.

Benchmarking and Profiling