Skip to content

RuntimeError: Detected mismatch between collectives on ranks when launching multi-node TP=16 deployment #5520

@yansiyu550

Description

@yansiyu550

I'm trying to launch DeepSeek-Coder-V2-Lite-Instruct using sglang.launch_server with tensor parallelism (TP)=16 on two nodes. When I run TP=8 on a single node, everything works fine. But when switching to multi-node TP=16, I get a collective mismatch error during broadcast().

master node:
GLOO_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN=nope NCCL_NET_GDR_READ=1 NCCL_DEBUG=TRACE NCCL_SOCKET_IFNAME=bond0 NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 python3 -m sglang.launch_server --model-path /workspace/data/DeepSeek-Coder-V2-Lite-Instruct --tp 16 --dist-init-addr 10.14.4.99:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --disable-cuda-graph --max-running-requests 256 --mem-fraction-static 0.95
worker node:
GLOO_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN=nope NCCL_NET_GDR_READ=1 NCCL_DEBUG=TRACE NCCL_SOCKET_IFNAME=bond0 NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 python3 -m sglang.launch_server --model-path /workspace/data/DeepSeek-Coder-V2-Lite-Instruct --tp 16 --dist-init-addr 10.14.4.99:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --disable-cuda-graph --max-running-requests 256 --mem-fraction-static 0.95

It was running normally at first, but then it just hung:
Image

Image

Then, I enabled debug mode with TORCH_DISTRIBUTED_DEBUG=DETAIL:
Error Log

[2025-04-18 07:15:29 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 2057, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 260, in init
self.tp_worker = TpWorkerClass(
File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker.py", line 129, in init
self.random_seed = broadcast_pyobj(
File "/workspace/data/sglang/python/sglang/srt/utils.py", line 871, in broadcast_pyobj
dist.broadcast(tensor_size, src=src, group=dist_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 4 is running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=41, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Sequence number: 13vs 41
[2025-04-18 07:15:29 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 2057, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 260, in init
self.tp_worker = TpWorkerClass(
File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker.py", line 129, in init
self.random_seed = broadcast_pyobj(
File "/workspace/data/sglang/python/sglang/srt/utils.py", line 871, in broadcast_pyobj
dist.broadcast(tensor_size, src=src, group=dist_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=41, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Sequence number: 13vs 41
[2025-04-18 07:15:29] Received sigquit from a child process. It usually means the child failed.
Killed

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions