[BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC#157453
[BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC#157453Skylion007 wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157453
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c2d6d7b with merge base 82eefae ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
8119930 to
99dabab
Compare
c76ed71 to
1b12f69
Compare
|
Great Ubuntu is having an outage |
1b12f69 to
f8899ce
Compare
f8899ce to
c84bbb0
Compare
c84bbb0 to
c2d6d7b
Compare
kwen2501
left a comment
There was a problem hiding this comment.
Thank you so much! LGTM!
| cp -a "libnvshmem/include/"* /usr/local/include/ | ||
| cp -a "libnvshmem/lib/"* /usr/local/lib/ |
There was a problem hiding this comment.
Oh I was previously detecting NVSHMEM lib from python/site-packages/nvidia.
Maybe I would need to change how we set NVSHMEM_HOME based on this change?
There was a problem hiding this comment.
It should be detected either way here? We do use NVSHMEM from the python package anyway at the end for non-static builds.
There was a problem hiding this comment.
This is how we currently detect it for building:
pytorch/tools/setup_helpers/cmake.py
Lines 316 to 320 in d5d14ee
And there is a gate here that stops building torch with NVSHMEM if NVSHMEM_HOME is not set:
Lines 998 to 1002 in 0105cd8
I see two solutions:
(1) We add export NVSHMEM_HOME=/usr/local here, but it looks a bit weird;
(2) We add a findNVSHMEM.cmake as a single place to detect the library and set NVSHMEM_HOME there.
What do you think?
There was a problem hiding this comment.
I'd say a findNVSHMEM.cmake
There was a problem hiding this comment.
The pip version is fine too obviously, but it didn't seem to work for you other build?
There was a problem hiding this comment.
So libnvshmem comes with libnvshmem/lib/cmake/nvshmem/NVSHMEMConfig.cmake, but it doesn't set the NVSHMEM home.
There was a problem hiding this comment.
@kwen2501 This more complicated as our current CMake is broken when we build with NVSHMEM because it tries to build for unsupported architectures.
There was a problem hiding this comment.
@kwen2501 Also libtorch will be missing it now, right?
Correct. So we'd need to find a way to work for both. (pip install or system install)
There was a problem hiding this comment.
@kwen2501 Proper way is just do CMake. It will work pip install thanks the to the rpath changes we made.
There was a problem hiding this comment.
Thanks. I added CMake search here: #157513.
Do you mind taking a look? Thank you!
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ation" Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. Pull Request resolved: #157513 Approved by: https://github.com/atalman, https://github.com/Skylion007
…ation" Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. Pull Request resolved: #157513 Approved by: https://github.com/atalman, https://github.com/Skylion007 (cherry picked from commit 99c1a6b)
|
@pytorchbot cherry-pick --onto release/2.8 -c critical |
…atest for 2.8RC (#157453) Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support. This is a proper fix for: #157411 Pull Request resolved: #157453 Approved by: https://github.com/kwen2501, https://github.com/atalman (cherry picked from commit a6fab82)
Cherry picking #157453The cherry pick PR is at #157774 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.
This is a proper fix for: #157411