[SymmMem] find_path does not search /usr/local/lib#157695
[SymmMem] find_path does not search /usr/local/lib#157695kwen2501 wants to merge 6 commits intogh/kwen2501/192/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157695
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 24 PendingAs of commit 2e94d8e with merge base a6fab82 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
|
@kwen2501 Okay good, we have some compilation breakage on master to sort it. Missing includes perhaps? |
| PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib | ||
| DOC "The location of NVSHMEM device library.") | ||
| find_path(NVSHMEM_INCLUDE_DIR | ||
| NAMES nvshmem.h |
There was a problem hiding this comment.
Is this really the only relevant include file?
There was a problem hiding this comment.
Yeah, it seems NVSHMEM would include nvshmemx_collective_launch from nvshmem.h for some higher SM arch's, but not for some lower ones (the failed test runs on SM 52). Now I added #include <nvshmemx.h> for the lower SMs.
caffe2/CMakeLists.txt
Outdated
| find_library(NVSHMEM_DEVICE_LIB | ||
| # Device lib is a `.a` file | ||
| NAMES nvshmem_device | ||
| PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib |
There was a problem hiding this comment.
| PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib | |
| HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR} | |
| PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64) |
caffe2/CMakeLists.txt
Outdated
| # In pip install case, the lib suffix is `.so.3` instead of `.so` | ||
| NAMES libnvshmem_host.so libnvshmem_host.so.3 | ||
| NAMES nvshmem_host nvshmem_host.so.3 | ||
| PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib |
There was a problem hiding this comment.
| PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib | |
| HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR} | |
| PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64) |
There was a problem hiding this comment.
Okay, I don't it actually found NVSHMEM with the hints for some reason...
|
Nit: it just printed |
| message(STATUS " USE_SYSTEM_NCCL : ${USE_SYSTEM_NCCL}") | ||
| endif() | ||
| message(STATUS " NVSHMEM_LIB_DIR : ${NVSHMEM_LIB_DIR}") | ||
| message(STATUS " Found NVSHMEM : ${NVSHMEM_INCLUDE_DIR}") |
There was a problem hiding this comment.
This is printed even if message(STATUS " Found NVSHMEM : ${NVSHMEM_INCLUDE_DIR}") is empty...
There was a problem hiding this comment.
Empty means not found.
It works on my machine tho:
...
-- USE_NCCL : ON
-- USE_SYSTEM_NCCL : OFF
-- Found NVSHMEM : /usr/local/include
-- USE_NNPACK : 0
-- USE_NUMPY : ON
...
There was a problem hiding this comment.
Yeah, it's sadly not using the hints. :(
| PATHS $ENV{NVSHMEM_HOME}/include ${NVSHMEM_PY_DIR}/include | ||
| HINTS $ENV{NVSHMEM_HOME}/include ${NVSHMEM_PY_DIR}/include |
There was a problem hiding this comment.
We missed one replacement of PATHS -> HINTS previously.
HINTS: searched before default paths.
PATHS: searched after default paths.
Now it should be good. (I tested it locally)
|
@pytorchbot merge -f "Tests passed previously; changing from PATHS to HINTS to prioritize search" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.8 -c critical |
Cherry picking #157695Command Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot revert |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "Changing it to be landable on 2.8 branch" -c landrace |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 3effe0c. Reverted #157695 on behalf of https://github.com/kwen2501 due to Changing it to be landable on 2.8 branch ([comment](#157695 (comment)))
|
@kwen2501 your PR has been successfully reverted. |
This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k [ghstack-poisoned]
|
@pytorchbot merge -f "Remove small changes to test_nvshmem_triton.py to make PR patchable onto 2.8 release branch" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto cherry-pick-157513-by-pytorch_bot_bot_ -c critical |
This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. Pull Request resolved: #157695 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513 (cherry picked from commit c558907)
Cherry picking #157695The cherry pick PR is at #157765 and it is recommended to link a critical cherry pick PR with an issue. Details for Dev Infra teamRaised by workflow job |
Before, if NVSHMEM is installed at *BOTH* system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found:
```
-- NVSHMEM_HOME set to: ''
-- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB: '/usr/local/lib/libnvshmem_host.so'
-- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`.
```
find_library(NVSHMEM_HOST_LIB
# In pip install case, the lib suffix is `.so.3` instead of `.so`
NAMES nvshmem_host nvshmem_host.so.3
HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)
```
This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc:
> The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it.
After this PR:
```
-- NVSHMEM_HOME set to: ''
-- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3'
-- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
Pull Request resolved: #157836
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #157513, #157695
Stack from ghstack (oldest at bottom):
This PR uses
find_libraryto replacefind_path.It also searches for NVSHMEM host lib and device lib separately.
Tested against system install location: /usr/local/lib and /usr/local/include.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k