Skip to content

[Bug] ipv6 dist_init_addr doesn't connect when running multi-node inference #2892

@qpwo

Description

@qpwo

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

If I set --dist-init-addr [00:11:22:33:44:55]:1234 --, then the nodes don't finish connecting. It seems that 1 process completes this line, but the other 3 don't. I don't know if this is an issue in sglang, vllm, or pytorch. I tried both gloo and nccl. I also tried llama 8b with 2x 2x h100s and deepseek v3 with 2x 8x h100s. I waited ten minutes for a connection.

I ran the same code with aphrodite v0.6.5 with llama and it worked. I'm not sure what the key difference is.

Perhaps this is related? pytorch/pytorch#52040

Reproduction

You can see the full exact thing I ran here. https://gist.github.com/qpwo/f2e5d2a99e775f7ac54e05c6254191ae
modal only offers ipv6 connections for clusters.

This is the main part:

        python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B \
            --tp-size {tp}  --nnodes {num_nodes} --node-rank {node_rank} \
            --port 2242 --dist-init-addr [{main_addr}]:1234 \
            --disable-cuda-graph --trust-remote-code

Environment

Output of python3 -m sglang.check_env:

Python: 3.12.6 (main, Sep 27 2024, 06:10:12) [GCC 12.2.0]
CUDA available: True
GPU 0,1: NVIDIA H100 80GB HBM3
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: None
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post5
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.10.8
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.1
orjson: 3.10.14
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.7
anthropic: 0.43.0
decord: 0.6.0
Hypervisor vendor: KVM
ulimit soft: 65536

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions