[SymmMem] Find NVSHMEM from system installation#157513
[SymmMem] Find NVSHMEM from system installation#157513kwen2501 wants to merge 5 commits intogh/kwen2501/189/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157513
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled Job, 48 PendingAs of commit 1c92bd6 with merge base 7597988 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
||
| // NVSHMEM minimum SM arch | ||
| #define _NVSHMEM_MIN_SM_ARCH 700 | ||
|
|
||
| // Some NVSHMEM device APIs do not compile on older SM archs | ||
| #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < _NVSHMEM_MIN_SM_ARCH) | ||
| // Only include host APIs. See nvshmem.h for details. | ||
| #define NVSHMEM_HOSTLIB_ONLY | ||
| #endif // Must be done before nvshmem.h is included | ||
|
|
There was a problem hiding this comment.
@atalman Solving the sm_52 compile issue.
tools/setup_helpers/cmake.py
Outdated
| build_options["NVSHMEM_HOME"] = nvshmem_home | ||
| nvshmem_py_dir = py_lib_path + "/nvidia/nvshmem" | ||
| if os.path.exists(nvshmem_py_dir): | ||
| print("NVSHMEM wheel install found: " + nvshmem_py_dir) |
There was a problem hiding this comment.
This print should be removed before merge
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
|
@pytorchbot merge -f "CI passed; removing a print statement" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
I never actually exported NVSHMEM for the builder so this patch is missing one more part: So we still need one more component sadly to properly set NVSHMEM... |
Skylion007
left a comment
There was a problem hiding this comment.
We still need some more NVSHMEM locations set
| find_path(NVSHMEM_LIB_DIR | ||
| # In pip install case, the lib suffix is `.so.3` instead of `.so` | ||
| NAMES libnvshmem_host.so libnvshmem_host.so.3 | ||
| PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib |
There was a problem hiding this comment.
We never specify NVSHMEM_HOME so it's not building with it...
There was a problem hiding this comment.
arg, I thought find_path would search for default paths if the file is not found from the specified paths.
There was a problem hiding this comment.
I have a question: why did the the original impl fail to extrat NVSHMEM from system installation? Thanks
There was a problem hiding this comment.
@vinjn because NVSHMEM_HOME wasn't set to system install location unless user sets it. The PR uses CMake's find_... to search for it.
`maybe_initialize_env_vars` and `initialize_nvshmem_with_store` are only used in `NVSHMEMSymmetricMemory.cu`. Moving them there. Pull Request resolved: #157611 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513
This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. Pull Request resolved: #157695 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513
|
@pytorchbot cherry-pick --onto release/2.8 -c Compilation |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot cherry-pick --onto release/2.8 -c critical |
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. Pull Request resolved: #157513 Approved by: https://github.com/atalman, https://github.com/Skylion007 (cherry picked from commit 99c1a6b)
Cherry picking #157513The cherry pick PR is at #157755 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. Pull Request resolved: #157695 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513
This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. Pull Request resolved: #157695 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513 (cherry picked from commit c558907)
Before, if NVSHMEM is installed at *BOTH* system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found:
```
-- NVSHMEM_HOME set to: ''
-- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB: '/usr/local/lib/libnvshmem_host.so'
-- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`.
```
find_library(NVSHMEM_HOST_LIB
# In pip install case, the lib suffix is `.so.3` instead of `.so`
NAMES nvshmem_host nvshmem_host.so.3
HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)
```
This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc:
> The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it.
After this PR:
```
-- NVSHMEM_HOME set to: ''
-- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3'
-- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
Pull Request resolved: #157836
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #157513, #157695
Stack from ghstack (oldest at bottom):
Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into
/usr/localon our CI machines.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @Skylion007