RuntimeError: Detected mismatch between collectives on ranks when launching multi-node TP=16 deployment

I'm trying to launch DeepSeek-Coder-V2-Lite-Instruct using sglang.launch_server with tensor parallelism (TP)=16 on two nodes. When I run TP=8 on a single node, everything works fine. But when switching to multi-node TP=16, I get a collective mismatch error during broadcast().

master node：
`GLOO_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN=nope NCCL_NET_GDR_READ=1 NCCL_DEBUG=TRACE NCCL_SOCKET_IFNAME=bond0 NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1  python3 -m sglang.launch_server  --model-path /workspace/data/DeepSeek-Coder-V2-Lite-Instruct --tp 16 --dist-init-addr 10.14.4.99:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --disable-cuda-graph --max-running-requests 256 --mem-fraction-static 0.95`
worker node：
`GLOO_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN=nope NCCL_NET_GDR_READ=1 NCCL_DEBUG=TRACE NCCL_SOCKET_IFNAME=bond0 NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1  python3 -m sglang.launch_server  --model-path /workspace/data/DeepSeek-Coder-V2-Lite-Instruct --tp 16 --dist-init-addr 10.14.4.99:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --disable-cuda-graph --max-running-requests 256 --mem-fraction-static 0.95`

It was running normally at first, but then it just hung:
![Image](https://github.com/user-attachments/assets/cb3b6a02-ae5d-452e-8d46-d8f7d361ac25)

![Image](https://github.com/user-attachments/assets/bd1340ee-93a2-4aad-80e6-c85993ffb117)

Then, I enabled debug mode with TORCH_DISTRIBUTED_DEBUG=DETAIL:
Error Log 

> [2025-04-18 07:15:29 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 2057, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker.py", line 129, in __init__
    self.random_seed = broadcast_pyobj(
  File "/workspace/data/sglang/python/sglang/srt/utils.py", line 871, in broadcast_pyobj
    dist.broadcast(tensor_size, src=src, group=dist_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
    work = group.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 4 is running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=41, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: 	 Sequence number: 13vs 41
[2025-04-18 07:15:29 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 2057, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 260, in __init__
    self.tp_worker = TpWorkerClass(
  File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker.py", line 129, in __init__
    self.random_seed = broadcast_pyobj(
  File "/workspace/data/sglang/python/sglang/srt/utils.py", line 871, in broadcast_pyobj
    dist.broadcast(tensor_size, src=src, group=dist_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
    work = group.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=41, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: 	 Sequence number: 13vs 41
[2025-04-18 07:15:29] Received sigquit from a child process. It usually means the child failed.
Killed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Detected mismatch between collectives on ranks when launching multi-node TP=16 deployment #5520

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: Detected mismatch between collectives on ranks when launching multi-node TP=16 deployment #5520

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions