Checklist
Describe the bug
If I set --dist-init-addr [00:11:22:33:44:55]:1234 --, then the nodes don't finish connecting. It seems that 1 process completes this line, but the other 3 don't. I don't know if this is an issue in sglang, vllm, or pytorch. I tried both gloo and nccl. I also tried llama 8b with 2x 2x h100s and deepseek v3 with 2x 8x h100s. I waited ten minutes for a connection.
I ran the same code with aphrodite v0.6.5 with llama and it worked. I'm not sure what the key difference is.
Perhaps this is related? pytorch/pytorch#52040
Reproduction
You can see the full exact thing I ran here. https://gist.github.com/qpwo/f2e5d2a99e775f7ac54e05c6254191ae
modal only offers ipv6 connections for clusters.
This is the main part:
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B \
--tp-size {tp} --nnodes {num_nodes} --node-rank {node_rank} \
--port 2242 --dist-init-addr [{main_addr}]:1234 \
--disable-cuda-graph --trust-remote-code
Environment
Output of python3 -m sglang.check_env:
Python: 3.12.6 (main, Sep 27 2024, 06:10:12) [GCC 12.2.0]
CUDA available: True
GPU 0,1: NVIDIA H100 80GB HBM3
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: None
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post5
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.10.8
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.1
orjson: 3.10.14
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.7
anthropic: 0.43.0
decord: 0.6.0
Hypervisor vendor: KVM
ulimit soft: 65536
Checklist
Describe the bug
If I set
--dist-init-addr [00:11:22:33:44:55]:1234 --, then the nodes don't finish connecting. It seems that 1 process completes this line, but the other 3 don't. I don't know if this is an issue in sglang, vllm, or pytorch. I tried both gloo and nccl. I also tried llama 8b with 2x 2x h100s and deepseek v3 with 2x 8x h100s. I waited ten minutes for a connection.I ran the same code with aphrodite v0.6.5 with llama and it worked. I'm not sure what the key difference is.
Perhaps this is related? pytorch/pytorch#52040
Reproduction
You can see the full exact thing I ran here. https://gist.github.com/qpwo/f2e5d2a99e775f7ac54e05c6254191ae
modal only offers ipv6 connections for clusters.
This is the main part:
Environment
Output of
python3 -m sglang.check_env: