-
Notifications
You must be signed in to change notification settings - Fork 27.3k
Description
🐛 Describe the bug
Nightly 2.8 torch results in an error during attempt to init a distributed training
import sys
import os
import torch.distributed as dist
from random import randint
import torch
os.environ["USE_LIBUV"] = "0" if sys.platform == "win32" else "1"
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = str(randint(20000, 55555))
device = torch.device("cuda")
n_gpus = 1
rank = 0
dist.init_process_group(
backend="gloo" if sys.platform == "win32" or device.type != "cuda" else "nccl",
init_method="env://",
world_size=n_gpus if device.type == "cuda" else 1,
rank=rank if device.type == "cuda" else 0,
)
print("done")Traceback (most recent call last):
File "T:\test.py", line 16, in
dist.init_process_group(
File "X:\torch\venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "X:\torch\venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 95, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "X:\torch\venv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1724, in init_process_group
default_pg, _ = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "X:\torch\venv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1949, in _new_process_group_helper
backend_class = ProcessGroupGloo(
^^^^^^^^^^^^^^^^^
RuntimeError: makeDeviceForHostname(): unsupported gloo device
Versions
installed torch versions
torch-2.8.0.dev20250327+cu128 torchaudio-2.6.0.dev20250331+cu128 torchvision-0.22.0.dev20250331+cu128
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @kwen2501 @c-p-i-o