🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- Setup 2 machines with dual IPv4/IPv6 interfaces
- on first machine runthe following script
import torch.distributed as dist
from datetime import timedelta
import socket
port = 12345
ip = "127.0.0.1"
ip = "::"
try:
server_store = dist.TCPStore(ip, port, 2, True, timedelta(seconds=30))
print("initialized")
server_store.set("first_key", "first_value")
print("stored")
except RuntimeError as e:
print(e)
- on second machine run the following script
import torch.distributed as dist
from datetime import timedelta
import socket
port = 12345
ip = "172.0.0.1"
ip = "xxxx:yyyy:zzzz::"
client_store = dist.TCPStore(ip, port, 2, False, timedelta(seconds=30))
print("initialized")
print(client_store.get("first_key"))
- Observe that this code works for IPv4 (127.0.0.1 on first, ipv4 of first machine on second) but doesn't work when IPv6 address is used on master node. Second/subordinate node uses IPv6 correctly.
- Proposed fix:
In https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Utils.cpp#L90
in line ::getaddrinfo nullptr is passed as a first argument, instead masteraddr should probably be passed down from outer callstack. We verified that if we replace nullptr with "::" the code works.
Expected behavior
TCPStore should work with both IPv4 and IPv6 addresses the same way.
Environment
Network interface with ipv4 and ipv6 address.
- PyTorch Version (e.g., 1.0): 1.7.1
- OS (e.g., Linux): Linux
- How you installed PyTorch (
conda, pip, source): conda
- Build command you used (if compiling from source):
- Python version: 3.7.3
- CUDA/cuDNN version: 11.2, 8.0
- GPU models and configuration: V100
- Any other relevant information:
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
In https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Utils.cpp#L90
in line ::getaddrinfo nullptr is passed as a first argument, instead masteraddr should probably be passed down from outer callstack. We verified that if we replace nullptr with "::" the code works.
Expected behavior
TCPStore should work with both IPv4 and IPv6 addresses the same way.
Environment
Network interface with ipv4 and ipv6 address.
conda,pip, source): condacc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu