[Bug] v0.4.5post1 and 2 issue regarding single node and multi-node inference deployment for DeepSeek V3/R1

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

For both single node and multi-node where I built the sglang from git clone source, 0.4.5postv2 got stuck after cloning shared experts to all GPUs, and eventually lead to error. See https://github.com/sgl-project/sglang/issues/5514#issuecomment-2817492146

On 0.4.5postv1, single node can be deployed successfully by not using `SGL_ENABLE_JIT_DEEPGEMM=1`.

On multi-node though, it failed with following error. 

```
[2025-04-21 04:08:43 TP5] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP2] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP3] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP0] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.80 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP7] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.77 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP4] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP1] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP6] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[rank1]:[E421 04:13:43.407134717 ProcessGroupGloo.cpp:143] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
    self.load_model()
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
    raise ValueError(
ValueError: TP rank 1 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank2]:[E421 04:13:43.416692655 ProcessGroupGloo.cpp:143] Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
    self.load_model()
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
    raise ValueError(
ValueError: TP rank 2 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank3]:[E421 04:13:43.426633564 ProcessGroupGloo.cpp:143] Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
    self.load_model()
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
    raise ValueError(
ValueError: TP rank 3 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank4]:[E421 04:13:43.437350308 ProcessGroupGloo.cpp:143] Rank 4 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
    self.load_model()
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
    raise ValueError(
ValueError: TP rank 4 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank5]:[E421 04:13:43.446837052 ProcessGroupGloo.cpp:143] Rank 5 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP5] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
    self.load_model()
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
    raise ValueError(
ValueError: TP rank 5 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank6]:[E421 04:13:43.457389636 ProcessGroupGloo.cpp:143] Rank 6 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
    self.load_model()
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
    raise ValueError(
ValueError: TP rank 6 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank7]:[E421 04:13:43.466985296 ProcessGroupGloo.cpp:143] Rank 7 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP7] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
    self.model_runner = ModelRunner(
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
    self.load_model()
  File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
    raise ValueError(
ValueError: TP rank 7 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.

[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.

```

### Reproduction

Invoking command for single node

```
python3 -m sglang.launch_server --model /models/DeepSeekV3/DeepSeek-V3 --tp 8 --trust-remote-code --enable-dp-attention --dp-size 8
```

Invoking command for multi-nodes

Master node
```
python3 -m sglang.launch_server \
--model-path /iofsx/sds3/models/DeepSeekV3/DeepSeek-V3 \
--tp 16 \
--dist-init-addr <master-node-ipaddress>:5000 \
--nnodes 2 \
--node-rank 0 \
--trust-remote-code \
--port 5000 \
--host 0.0.0.0 2>&1 | tee RUN_1.log
```

worker node

```
python3 -m sglang.launch_server \
--model-path /iofsx/sds3/models/DeepSeekV3/DeepSeek-V3 \
--tp 16 \
--dist-init-addr <master-node-ipaddress>:5000 \
--nnodes 2 \
--node-rank 1 \
--trust-remote-code \
--port 5000 \
--host 0.0.0.0 2>&1 | tee RUN_2.log

```

### Environment

On multi-node, it is two nodes of 8*H100.

On single-node, it is one node of 8*H200. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] v0.4.5post1 and 2 issue regarding single node and multi-node inference deployment for DeepSeek V3/R1 #5602

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] v0.4.5post1 and 2 issue regarding single node and multi-node inference deployment for DeepSeek V3/R1 #5602

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions