Skip to content

[SymmMem] find_path does not search /usr/local/lib#157695

Closed
kwen2501 wants to merge 6 commits intogh/kwen2501/192/basefrom
gh/kwen2501/192/head
Closed

[SymmMem] find_path does not search /usr/local/lib#157695
kwen2501 wants to merge 6 commits intogh/kwen2501/192/basefrom
gh/kwen2501/192/head

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Jul 7, 2025

Stack from ghstack (oldest at bottom):

This PR uses find_library to replace find_path.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157695

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 24 Pending

As of commit 2e94d8e with merge base a6fab82 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request Jul 7, 2025
ghstack-source-id: 43a6ecb
Pull Request resolved: #157695
@kwen2501 kwen2501 requested a review from Skylion007 July 7, 2025 07:19
@kwen2501 kwen2501 added the release notes: distributed (symm_mem) release note label for symmetric memory label Jul 7, 2025
@Skylion007
Copy link
Collaborator

@kwen2501 Okay good, we have some compilation breakage on master to sort it. Missing includes perhaps?

PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib
DOC "The location of NVSHMEM device library.")
find_path(NVSHMEM_INCLUDE_DIR
NAMES nvshmem.h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really the only relevant include file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems NVSHMEM would include nvshmemx_collective_launch from nvshmem.h for some higher SM arch's, but not for some lower ones (the failed test runs on SM 52). Now I added #include <nvshmemx.h> for the lower SMs.

find_library(NVSHMEM_DEVICE_LIB
# Device lib is a `.a` file
NAMES nvshmem_device
PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib
HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)

Borrowed from cudnn:

# In pip install case, the lib suffix is `.so.3` instead of `.so`
NAMES libnvshmem_host.so libnvshmem_host.so.3
NAMES nvshmem_host nvshmem_host.so.3
PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib
HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)

Copy link
Collaborator

@Skylion007 Skylion007 Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I don't it actually found NVSHMEM with the hints for some reason...

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 7, 2025
[ghstack-poisoned]
@Skylion007
Copy link
Collaborator

Nit: it just printed -- Found NVSHMEM : not sure if it actually found it, and it's a bit out of place with the orther prints

message(STATUS " USE_SYSTEM_NCCL : ${USE_SYSTEM_NCCL}")
endif()
message(STATUS " NVSHMEM_LIB_DIR : ${NVSHMEM_LIB_DIR}")
message(STATUS " Found NVSHMEM : ${NVSHMEM_INCLUDE_DIR}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is printed even if message(STATUS " Found NVSHMEM : ${NVSHMEM_INCLUDE_DIR}") is empty...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty means not found.
It works on my machine tho:

...
--   USE_NCCL              : ON
--     USE_SYSTEM_NCCL     : OFF
--   Found NVSHMEM         : /usr/local/include
--   USE_NNPACK            : 0
--   USE_NUMPY             : ON
...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's sadly not using the hints. :(

@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 7, 2025
[ghstack-poisoned]
Comment on lines -1009 to +1016
PATHS $ENV{NVSHMEM_HOME}/include ${NVSHMEM_PY_DIR}/include
HINTS $ENV{NVSHMEM_HOME}/include ${NVSHMEM_PY_DIR}/include
Copy link
Collaborator Author

@kwen2501 kwen2501 Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We missed one replacement of PATHS -> HINTS previously.
HINTS: searched before default paths.
PATHS: searched after default paths.
Now it should be good. (I tested it locally)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 7, 2025

@pytorchbot merge -f "Tests passed previously; changing from PATHS to HINTS to prioritize search"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 8, 2025

@pytorchbot cherry-pick --onto release/2.8 -c critical

@pytorchbot
Copy link
Collaborator

Cherry picking #157695

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 3effe0c293219b00a0eae7e139fe2d9aed84bc03 returned non-zero exit code 1

Auto-merging caffe2/CMakeLists.txt
CONFLICT (content): Merge conflict in caffe2/CMakeLists.txt
Auto-merging cmake/Summary.cmake
CONFLICT (content): Merge conflict in cmake/Summary.cmake
Auto-merging test/distributed/test_nvshmem.py
CONFLICT (modify/delete): test/distributed/test_nvshmem_triton.py deleted in HEAD and modified in 3effe0c2932 ([SymmMem] find_path does not search /usr/local/lib (#157695)).  Version 3effe0c2932 ([SymmMem] find_path does not search /usr/local/lib (#157695)) of test/distributed/test_nvshmem_triton.py left in tree.
Auto-merging torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu
error: could not apply 3effe0c2932... [SymmMem] find_path does not search /usr/local/lib (#157695)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 8, 2025

@pytorchbot revert

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 8, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message, -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 8, 2025

@pytorchbot revert -m "Changing it to be landable on 2.8 branch" -c landrace

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jul 8, 2025
This reverts commit 3effe0c.

Reverted #157695 on behalf of https://github.com/kwen2501 due to Changing it to be landable on 2.8 branch ([comment](#157695 (comment)))
@pytorchmergebot
Copy link
Collaborator

@kwen2501 your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jul 8, 2025
This PR uses `find_library` to replace `find_path`. 
It also searches for NVSHMEM host lib and device lib separately. 

Tested against system install location: /usr/local/lib and /usr/local/include.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jul 8, 2025
ghstack-source-id: d607d38
Pull Request resolved: #157695
@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 8, 2025

@pytorchbot merge -f "Remove small changes to test_nvshmem_triton.py to make PR patchable onto 2.8 release branch"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 8, 2025

@pytorchbot cherry-pick --onto cherry-pick-157513-by-pytorch_bot_bot_ -c critical

pytorchbot pushed a commit that referenced this pull request Jul 8, 2025
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: #157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513

(cherry picked from commit c558907)
@pytorchbot
Copy link
Collaborator

Cherry picking #157695

The cherry pick PR is at #157765 and it is recommended to link a critical cherry pick PR with an issue.

Details for Dev Infra team Raised by workflow job

pytorchmergebot pushed a commit that referenced this pull request Jul 14, 2025
Before, if NVSHMEM is installed at *BOTH* system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found:
```
-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB:  '/usr/local/lib/libnvshmem_host.so'
-- NVSHMEM_DEVICE_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```

The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`.
```
find_library(NVSHMEM_HOST_LIB
      # In pip install case, the lib suffix is `.so.3` instead of `.so`
      NAMES nvshmem_host nvshmem_host.so.3
      HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
      PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)
```

This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc:
> The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it.

After this PR:
```
-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3'
-- NVSHMEM_DEVICE_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
Pull Request resolved: #157836
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #157513, #157695
@github-actions github-actions bot deleted the gh/kwen2501/192/head branch August 7, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (symm_mem) release note label for symmetric memory Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants