I'm trying to launch DeepSeek-Coder-V2-Lite-Instruct using sglang.launch_server with tensor parallelism (TP)=16 on two nodes. When I run TP=8 on a single node, everything works fine. But when switching to multi-node TP=16, I get a collective mismatch error during broadcast().
master node:
GLOO_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN=nope NCCL_NET_GDR_READ=1 NCCL_DEBUG=TRACE NCCL_SOCKET_IFNAME=bond0 NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 python3 -m sglang.launch_server --model-path /workspace/data/DeepSeek-Coder-V2-Lite-Instruct --tp 16 --dist-init-addr 10.14.4.99:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --disable-cuda-graph --max-running-requests 256 --mem-fraction-static 0.95
worker node:
GLOO_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN=nope NCCL_NET_GDR_READ=1 NCCL_DEBUG=TRACE NCCL_SOCKET_IFNAME=bond0 NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 python3 -m sglang.launch_server --model-path /workspace/data/DeepSeek-Coder-V2-Lite-Instruct --tp 16 --dist-init-addr 10.14.4.99:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --disable-cuda-graph --max-running-requests 256 --mem-fraction-static 0.95
It was running normally at first, but then it just hung:


Then, I enabled debug mode with TORCH_DISTRIBUTED_DEBUG=DETAIL:
Error Log
[2025-04-18 07:15:29 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 2057, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 260, in init
self.tp_worker = TpWorkerClass(
File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker.py", line 129, in init
self.random_seed = broadcast_pyobj(
File "/workspace/data/sglang/python/sglang/srt/utils.py", line 871, in broadcast_pyobj
dist.broadcast(tensor_size, src=src, group=dist_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 4 is running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=41, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Sequence number: 13vs 41
[2025-04-18 07:15:29 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 2057, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/data/sglang/python/sglang/srt/managers/scheduler.py", line 260, in init
self.tp_worker = TpWorkerClass(
File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/data/sglang/python/sglang/srt/managers/tp_worker.py", line 129, in init
self.random_seed = broadcast_pyobj(
File "/workspace/data/sglang/python/sglang/srt/utils.py", line 871, in broadcast_pyobj
dist.broadcast(tensor_size, src=src, group=dist_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2421, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running collective: CollectiveFingerPrint(SequenceNumber=13, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=41, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Sequence number: 13vs 41
[2025-04-18 07:15:29] Received sigquit from a child process. It usually means the child failed.
Killed
I'm trying to launch DeepSeek-Coder-V2-Lite-Instruct using sglang.launch_server with tensor parallelism (TP)=16 on two nodes. When I run TP=8 on a single node, everything works fine. But when switching to multi-node TP=16, I get a collective mismatch error during broadcast().
master node:
GLOO_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN=nope NCCL_NET_GDR_READ=1 NCCL_DEBUG=TRACE NCCL_SOCKET_IFNAME=bond0 NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 python3 -m sglang.launch_server --model-path /workspace/data/DeepSeek-Coder-V2-Lite-Instruct --tp 16 --dist-init-addr 10.14.4.99:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --disable-cuda-graph --max-running-requests 256 --mem-fraction-static 0.95worker node:
GLOO_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN=nope NCCL_NET_GDR_READ=1 NCCL_DEBUG=TRACE NCCL_SOCKET_IFNAME=bond0 NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 python3 -m sglang.launch_server --model-path /workspace/data/DeepSeek-Coder-V2-Lite-Instruct --tp 16 --dist-init-addr 10.14.4.99:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --disable-cuda-graph --max-running-requests 256 --mem-fraction-static 0.95It was running normally at first, but then it just hung:

Then, I enabled debug mode with TORCH_DISTRIBUTED_DEBUG=DETAIL:
Error Log