Enable GLOO backend for torch.distributed on Windows#5694
Conversation
|
You're going to hit a link error with pytorch nightly that it can't find |
Thanks for the heads up. I've kicked off https://github.com/ROCm/TheRock/actions/runs/27155403450 which'll test this change against the nightly branch - let's see if it hits the same EDIT: Ya, ended up hitting the error you mentioned. Looks like there was a change in how the GLOO setup is done between |
|
pytorch/pytorch#186650 will resolve the build failure with nightly branch. Updated Build Windows PyTorch Wheels workflow to allow testing against my pytorch fork with the above change - build is passing now that we're guarding against gloo_hip on Windows. https://github.com/ROCm/TheRock/actions/runs/27163904722/job/80186050020 |
Huh, that doesn't seem right but on closer inspection it appears there is a gloo and gloo_hip? Any idea what the difference is? I'm wondering why my builds didn't fail due to symbol collisions or other weird stuff if they're supposed to be different. |
|
Ok so with The difference is that older releases (2.11.etc) rely on
|
Needed for ROCm/TheRock#5694. - PyTorch's LoadHIP.cmake migrated to native CMake HIP language support (enable_language(HIP) + find_package(hip CONFIG)) w/ pytorch@d921fd0 - The gloo submodule is still pinned at 3135b0b which depends on FindHIP.cmake. This leads to `lld-link: error: could not open 'gloo_hip.lib': no such file or directory` errors on Windows after gloo fails to locate HIP. - Bumping the gloo submodule to [bcd1672 ("ROCm: Migrate to native CMake HIP support")](pytorch/gloo@bcd1672) aligns gloo with the current native CMake HIP direction. **Testing** - Current nightly branch (gloo 3135b0b) -> gloo fails to find HIP leading to missing gloo_hip.lib Build Windows PyTorch Wheels (dev, 3.12, gfx1151) · ROCm/TheRock@2c48fe4 ``` Could not find a configuration file for package "HIP" that is compatible with requested version "1.0". The following configuration files were considered but not accepted: hip-config.cmake, version: 7.14.60850 The version found is not compatible with the version requested. lld-link: error: could not open 'gloo_hip.lib': no such file or directory ``` - nightly branch + bump (gloo bcd1672) -> gloo finds HIP and gloo_hip.lib is successfully linked Build Windows PyTorch Wheels (dev, 3.12, gfx1151) · ROCm/TheRock@d731f33 ``` -- Found HIP: 7.14.60850 -- GLOO_ROCM_ARCH: gfx1151 [2218/2880] Linking HIP static library lib\gloo_hip.lib ``` Pull Request resolved: pytorch#186787 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
|
One last piece before this can go in,
I'm testing out the former to see if builds pass with just the submodule bump and will check in the PyTorch team. |
|
SGTM. I'd prefer to add support uniformly across all supported versions (so yes, cherry-pick what is needed onto our release branches if otherwise low risk). Short of that, we can add the workaround to our build script as needed. |
Needed for ROCm/TheRock#5694. - PyTorch's LoadHIP.cmake migrated to native CMake HIP language support (enable_language(HIP) + find_package(hip CONFIG)) w/ pytorch@d921fd0 - The gloo submodule is still pinned at 3135b0b which depends on FindHIP.cmake. This leads to `lld-link: error: could not open 'gloo_hip.lib': no such file or directory` errors on Windows after gloo fails to locate HIP. - Bumping the gloo submodule to [bcd1672 ("ROCm: Migrate to native CMake HIP support")](pytorch/gloo@bcd1672) aligns gloo with the current native CMake HIP direction. **Testing** - Current nightly branch (gloo 3135b0b) -> gloo fails to find HIP leading to missing gloo_hip.lib Build Windows PyTorch Wheels (dev, 3.12, gfx1151) · ROCm/TheRock@2c48fe4 ``` Could not find a configuration file for package "HIP" that is compatible with requested version "1.0". The following configuration files were considered but not accepted: hip-config.cmake, version: 7.14.60850 The version found is not compatible with the version requested. lld-link: error: could not open 'gloo_hip.lib': no such file or directory ``` - nightly branch + bump (gloo bcd1672) -> gloo finds HIP and gloo_hip.lib is successfully linked Build Windows PyTorch Wheels (dev, 3.12, gfx1151) · ROCm/TheRock@d731f33 ``` -- Found HIP: 7.14.60850 -- GLOO_ROCM_ARCH: gfx1151 [2218/2880] Linking HIP static library lib\gloo_hip.lib ``` Pull Request resolved: pytorch#186787 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
Motivation
Resolves #3284
Our Windows PyTorch wheels have been shipping without torch.distributed support causing third party apps to fail with errors such as
Examples
These issues are seen on single GPU systems where there's no actual distributed work being done. We've been mitigating these issues by patching the upstream libraries to have fallback support when torch.distributed isn't available but the longterm solution is to enable the GLOO backend.
Technical Details
In
build_windows_pytorch_wheels.ymlIn
external-builds/pytorch/build_prod_wheels.pyTest Plan
Build Windows Pytorch Wheelsworkflow to generate wheelsTest Result
Using the wheels built from https://github.com/ROCm/TheRock/actions/runs/27149444602 on gfx1151,
Submission Checklist