Skip to content

Enable GLOO backend for torch.distributed on Windows#5694

Open
harkgill-amd wants to merge 5 commits into
mainfrom
users/harkgill/enable-gloo-windows
Open

Enable GLOO backend for torch.distributed on Windows#5694
harkgill-amd wants to merge 5 commits into
mainfrom
users/harkgill/enable-gloo-windows

Conversation

@harkgill-amd

@harkgill-amd harkgill-amd commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Motivation

Resolves #3284

Our Windows PyTorch wheels have been shipping without torch.distributed support causing third party apps to fail with errors such as

File "D:\languages\python\Lib\site-packages\torch\distributed\distributed_c10d.py", line 23, in
from torch._C._distributed_c10d import (
ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package

Examples

These issues are seen on single GPU systems where there's no actual distributed work being done. We've been mitigating these issues by patching the upstream libraries to have fallback support when torch.distributed isn't available but the longterm solution is to enable the GLOO backend.

Technical Details

In build_windows_pytorch_wheels.yml

  • Add vcpkg libuv install step and set libuv_ROOT so PyTorch's CMake finds it

In external-builds/pytorch/build_prod_wheels.py

  • Set USE_GLOO=ON unconditionally (was Linux-only)
  • Add copy_libuv_to_torch_lib() to bundle uv.dll into torch/lib/, following the same pattern as copy_msvc_libomp_to_torch_lib()

Test Plan

  • Run Build Windows Pytorch Wheels workflow to generate wheels
  • Test that both the GLOO backend and torch.distributed are available.

Test Result

Using the wheels built from https://github.com/ROCm/TheRock/actions/runs/27149444602 on gfx1151,

python -c "import torch.distributed; print('torch.distributed available:', torch.distributed.is_available()); print('GLOO available:', torch.distributed.is_gloo_available())"
torch.distributed available: True
GLOO available: True

Submission Checklist

@astrelsky

astrelsky commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

You're going to hit a link error with pytorch nightly that it can't find gloo_hip.lib even though gloo was built successfully and is at build/lib/gloo.lib. I never tracked down the change that caused it and instead took the lazy approach of copying it to build/gloo_hip.lib.

@harkgill-amd

harkgill-amd commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

You're going to hit a link error with pytorch nightly that it can't find gloo_hip.lib even though gloo was built successfully and is at build/lib/gloo.lib. I never tracked down the change that caused it and instead took the lazy approach of copying it to build/gloo_hip.lib.

Thanks for the heads up. I've kicked off https://github.com/ROCm/TheRock/actions/runs/27155403450 which'll test this change against the nightly branch - let's see if it hits the same gloo_hip.lib error and go from there.

EDIT: Ya, ended up hitting the error you mentioned. Looks like there was a change in how the GLOO setup is done between release/2.11 and nightly so we end up hitting the append here, https://github.com/pytorch/pytorch/blob/nightly/cmake/Dependencies.cmake#L1339. A simple guard to skip this for Windows should be sufficient as we're ok with CPU only gloo.lib.

@harkgill-amd

harkgill-amd commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

pytorch/pytorch#186650 will resolve the build failure with nightly branch.

Updated Build Windows PyTorch Wheels workflow to allow testing against my pytorch fork with the above change - build is passing now that we're guarding against gloo_hip on Windows. https://github.com/ROCm/TheRock/actions/runs/27163904722/job/80186050020

@astrelsky

Copy link
Copy Markdown
Contributor

pytorch/pytorch#186650 will resolve the build failure with nightly branch.

Updated Build Windows PyTorch Wheels workflow to allow testing against my pytorch fork with the above change - build is passing now that we're guarding against gloo_hip on Windows. https://github.com/ROCm/TheRock/actions/runs/27163904722/job/80186050020

Huh, that doesn't seem right but on closer inspection it appears there is a gloo and gloo_hip? Any idea what the difference is? I'm wondering why my builds didn't fail due to symbol collisions or other weird stuff if they're supposed to be different.

@harkgill-amd

Copy link
Copy Markdown
Contributor Author

Ok so with release/2.11, we were actually getting a working gloo_hip.lib so even though gloo.lib is enough to resolve the torch.distributed errors, there shouldn't be any reason why we can't get the former working with the nightly.

The difference is that older releases (2.11.etc) rely on FindHIP.cmake which works correctly with how the gloo submodule is currently setup to locate HIP.

nightly strips away FindHIP.cmake in favour of native CMake HIP language support but the gloo submodule is lagging behind at 3135b0b (still FindHIP.cmake dependent). bcd1672 updates gloo's CMake to be aligned with what nightly expects - bumping the upstream submodule to include this commit should be all we need.

Comment thread .github/workflows/build_windows_pytorch_wheels.yml Outdated
rmiao pushed a commit to rmiao/pytorch that referenced this pull request Jun 9, 2026
Needed for ROCm/TheRock#5694.

- PyTorch's LoadHIP.cmake migrated to native CMake HIP language support (enable_language(HIP) + find_package(hip CONFIG)) w/ pytorch@d921fd0

- The gloo submodule is still pinned at 3135b0b which depends on FindHIP.cmake. This leads to `lld-link: error: could not open 'gloo_hip.lib': no such file or directory` errors on Windows after gloo fails to locate HIP.

- Bumping the gloo submodule to [bcd1672 ("ROCm: Migrate to native CMake HIP support")](pytorch/gloo@bcd1672) aligns gloo with the current native CMake HIP direction.

**Testing**

- Current nightly branch (gloo 3135b0b) -> gloo fails to find HIP leading to missing gloo_hip.lib Build Windows PyTorch Wheels (dev, 3.12, gfx1151) · ROCm/TheRock@2c48fe4

```
  Could not find a configuration file for package "HIP" that is compatible
  with requested version "1.0".
  The following configuration files were considered but not accepted:
      hip-config.cmake, version: 7.14.60850
        The version found is not compatible with the version requested.
  lld-link: error: could not open 'gloo_hip.lib': no such file or directory
 ```

- nightly branch + bump (gloo bcd1672) -> gloo finds HIP and gloo_hip.lib is successfully linked Build Windows PyTorch Wheels (dev, 3.12, gfx1151) · ROCm/TheRock@d731f33

```
  -- Found HIP: 7.14.60850
  -- GLOO_ROCM_ARCH: gfx1151
  [2218/2880] Linking HIP static library lib\gloo_hip.lib
  ```

Pull Request resolved: pytorch#186787
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
Comment thread external-builds/pytorch/build_prod_wheels.py
Comment thread .github/workflows/build_windows_pytorch_wheels.yml
@harkgill-amd

Copy link
Copy Markdown
Contributor Author

One last piece before this can go in, release/2.9 and release/2.10 have gloo pinned to a commit that is one prior to the commit that introduced Windows ROCm support (1b4337a) - this is currently causing failures. The possible options are to,

  • Bump both PyTorch branches gloo commit to 1b4337a (just one commit up)
  • Disable GLOO for 2.9 and 2.10 with a guard in build_prod_wheels.py

I'm testing out the former to see if builds pass with just the submodule bump and will check in the PyTorch team.

@ScottTodd

Copy link
Copy Markdown
Member

SGTM. I'd prefer to add support uniformly across all supported versions (so yes, cherry-pick what is needed onto our release branches if otherwise low risk). Short of that, we can add the workaround to our build script as needed.

jemitche1 pushed a commit to jemitche1/pytorch that referenced this pull request Jun 13, 2026
Needed for ROCm/TheRock#5694.

- PyTorch's LoadHIP.cmake migrated to native CMake HIP language support (enable_language(HIP) + find_package(hip CONFIG)) w/ pytorch@d921fd0

- The gloo submodule is still pinned at 3135b0b which depends on FindHIP.cmake. This leads to `lld-link: error: could not open 'gloo_hip.lib': no such file or directory` errors on Windows after gloo fails to locate HIP.

- Bumping the gloo submodule to [bcd1672 ("ROCm: Migrate to native CMake HIP support")](pytorch/gloo@bcd1672) aligns gloo with the current native CMake HIP direction.

**Testing**

- Current nightly branch (gloo 3135b0b) -> gloo fails to find HIP leading to missing gloo_hip.lib Build Windows PyTorch Wheels (dev, 3.12, gfx1151) · ROCm/TheRock@2c48fe4

```
  Could not find a configuration file for package "HIP" that is compatible
  with requested version "1.0".
  The following configuration files were considered but not accepted:
      hip-config.cmake, version: 7.14.60850
        The version found is not compatible with the version requested.
  lld-link: error: could not open 'gloo_hip.lib': no such file or directory
 ```

- nightly branch + bump (gloo bcd1672) -> gloo finds HIP and gloo_hip.lib is successfully linked Build Windows PyTorch Wheels (dev, 3.12, gfx1151) · ROCm/TheRock@d731f33

```
  -- Found HIP: 7.14.60850
  -- GLOO_ROCM_ARCH: gfx1151
  [2218/2880] Linking HIP static library lib\gloo_hip.lib
  ```

Pull Request resolved: pytorch#186787
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

[Feature]: Add libuv for gloo and torch.distributed on windows

3 participants