For both single node and multi-node where I built the sglang from git clone source, 0.4.5postv2 got stuck after cloning shared experts to all GPUs, and eventually lead to error. See #5514 (comment)
On multi-node though, it failed with following error.
[2025-04-21 04:08:43 TP5] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP2] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP3] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP0] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.80 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP7] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.77 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP4] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP1] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[2025-04-21 04:08:43 TP6] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=37.76 GB, mem usage=40.46 GB.
[rank1]:[E421 04:13:43.407134717 ProcessGroupGloo.cpp:143] Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
self.tp_worker = TpWorkerClass(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
self.model_runner = ModelRunner(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
self.load_model()
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
raise ValueError(
ValueError: TP rank 1 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank2]:[E421 04:13:43.416692655 ProcessGroupGloo.cpp:143] Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
self.tp_worker = TpWorkerClass(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
self.model_runner = ModelRunner(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
self.load_model()
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
raise ValueError(
ValueError: TP rank 2 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank3]:[E421 04:13:43.426633564 ProcessGroupGloo.cpp:143] Rank 3 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
self.tp_worker = TpWorkerClass(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
self.model_runner = ModelRunner(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
self.load_model()
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
raise ValueError(
ValueError: TP rank 3 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank4]:[E421 04:13:43.437350308 ProcessGroupGloo.cpp:143] Rank 4 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
self.tp_worker = TpWorkerClass(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
self.model_runner = ModelRunner(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
self.load_model()
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
raise ValueError(
ValueError: TP rank 4 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank5]:[E421 04:13:43.446837052 ProcessGroupGloo.cpp:143] Rank 5 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP5] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
self.tp_worker = TpWorkerClass(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
self.model_runner = ModelRunner(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
self.load_model()
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
raise ValueError(
ValueError: TP rank 5 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank6]:[E421 04:13:43.457389636 ProcessGroupGloo.cpp:143] Rank 6 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP6] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
self.tp_worker = TpWorkerClass(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
self.model_runner = ModelRunner(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
self.load_model()
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
raise ValueError(
ValueError: TP rank 6 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
[rank7]:[E421 04:13:43.466985296 ProcessGroupGloo.cpp:143] Rank 7 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2025-04-21 04:13:43 TP7] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2056, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 260, in __init__
self.tp_worker = TpWorkerClass(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 75, in __init__
self.model_runner = ModelRunner(
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 182, in __init__
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 192, in initialize
self.load_model()
File "/home/ubuntu/miniconda3/envs/sglang2/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in load_model
raise ValueError(
ValueError: TP rank 7 could finish the model loading, but there are other ranks that didn't finish loading. It is likely due to unexpected failures (e.g., OOM) or a slow node.
[2025-04-21 04:13:43] Received sigquit from a child process. It usually means the child failed.
On multi-node, it is two nodes of 8*H100.
On single-node, it is one node of 8*H200.
Checklist
Describe the bug
For both single node and multi-node where I built the sglang from git clone source, 0.4.5postv2 got stuck after cloning shared experts to all GPUs, and eventually lead to error. See #5514 (comment)
On 0.4.5postv1, single node can be deployed successfully by not using
SGL_ENABLE_JIT_DEEPGEMM=1.On multi-node though, it failed with following error.
Reproduction
Invoking command for single node
Invoking command for multi-nodes
Master node
worker node
Environment
On multi-node, it is two nodes of 8*H100.
On single-node, it is one node of 8*H200.