Skip to content

torch.distributed.TCPStore doesn't work with dual IPv4/IPv6 network interface #52040

@andrei-pokrovsky

Description

@andrei-pokrovsky

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. Setup 2 machines with dual IPv4/IPv6 interfaces
  2. on first machine runthe following script
import torch.distributed as dist
from datetime import timedelta
import socket

port = 12345
ip = "127.0.0.1"
ip = "::"
try:
    server_store = dist.TCPStore(ip, port, 2, True, timedelta(seconds=30))
    print("initialized")
    server_store.set("first_key", "first_value")
    print("stored")
except RuntimeError as e:
    print(e)

  1. on second machine run the following script
import torch.distributed as dist
from datetime import timedelta
import socket

port = 12345
ip = "172.0.0.1"
ip = "xxxx:yyyy:zzzz::"
client_store = dist.TCPStore(ip, port, 2, False, timedelta(seconds=30))
print("initialized")
print(client_store.get("first_key"))
  1. Observe that this code works for IPv4 (127.0.0.1 on first, ipv4 of first machine on second) but doesn't work when IPv6 address is used on master node. Second/subordinate node uses IPv6 correctly.
  2. Proposed fix:
    In https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Utils.cpp#L90
    in line ::getaddrinfo nullptr is passed as a first argument, instead masteraddr should probably be passed down from outer callstack. We verified that if we replace nullptr with "::" the code works.

Expected behavior

TCPStore should work with both IPv4 and IPv6 addresses the same way.

Environment

Network interface with ipv4 and ipv6 address.

  • PyTorch Version (e.g., 1.0): 1.7.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): conda
  • Build command you used (if compiling from source):
  • Python version: 3.7.3
  • CUDA/cuDNN version: 11.2, 8.0
  • GPU models and configuration: V100
  • Any other relevant information:

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions