[Bug] ipv6 dist_init_addr doesn't connect when running multi-node inference

### Checklist

- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.

### Describe the bug

If I set `--dist-init-addr [00:11:22:33:44:55]:1234 --`, then the nodes don't finish connecting. It seems that 1 process completes [this line](https://github.com/sgl-project/sglang/blob/f005758f2bcf367739a5a71a90b91d18b56aa4cd/python/sglang/srt/model_executor/model_runner.py#L237), but the other 3 don't. I don't know if this is an issue in sglang, vllm, or pytorch. I tried both gloo and nccl. I also tried llama 8b with 2x 2x h100s and deepseek v3 with 2x 8x h100s. I waited ten minutes for a connection.

I ran the same code with aphrodite v0.6.5 with llama and it worked. I'm not sure what the key difference is.

Perhaps this is related? https://github.com/pytorch/pytorch/issues/52040

### Reproduction

You can see the full exact thing I ran here. https://gist.github.com/qpwo/f2e5d2a99e775f7ac54e05c6254191ae \
modal only offers ipv6 connections for clusters.

This is the main part:

```
        python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B \
            --tp-size {tp}  --nnodes {num_nodes} --node-rank {node_rank} \
            --port 2242 --dist-init-addr [{main_addr}]:1234 \
            --disable-cuda-graph --trust-remote-code
```

### Environment

Output of `python3 -m sglang.check_env`:
```
Python: 3.12.6 (main, Sep 27 2024, 06:10:12) [GCC 12.2.0]
CUDA available: True
GPU 0,1: NVIDIA H100 80GB HBM3
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: None
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post5
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.10.8
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.1
orjson: 3.10.14
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.7
anthropic: 0.43.0
decord: 0.6.0
Hypervisor vendor: KVM
ulimit soft: 65536
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] ipv6 dist_init_addr doesn't connect when running multi-node inference #2892

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] ipv6 dist_init_addr doesn't connect when running multi-node inference #2892

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions