[SymmMem] find_path does not search /usr/local/lib by kwen2501 · Pull Request #157695 · pytorch/pytorch

kwen2501 · 2025-07-07T07:14:02Z

Stack from ghstack (oldest at bottom):

This PR uses find_library to replace find_path.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-07-07T07:14:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157695

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 24 Pending

As of commit 2e94d8e with merge base a6fab82 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: 43a6ecb Pull Request resolved: #157695

Skylion007 · 2025-07-07T13:00:30Z

@kwen2501 Okay good, we have some compilation breakage on master to sort it. Missing includes perhaps?

Skylion007 · 2025-07-07T13:01:10Z

caffe2/CMakeLists.txt

+      PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib
+      DOC "The location of NVSHMEM device library.")
    find_path(NVSHMEM_INCLUDE_DIR
      NAMES nvshmem.h


Is this really the only relevant include file?

Yeah, it seems NVSHMEM would include nvshmemx_collective_launch from nvshmem.h for some higher SM arch's, but not for some lower ones (the failed test runs on SM 52). Now I added #include <nvshmemx.h> for the lower SMs.

Skylion007 · 2025-07-07T15:28:28Z

caffe2/CMakeLists.txt

+    find_library(NVSHMEM_DEVICE_LIB
+      # Device lib is a `.a` file
+      NAMES nvshmem_device
+      PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib


Suggested change

PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib

HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}

PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)

Borrowed from cudnn:

Skylion007 · 2025-07-07T15:31:39Z

caffe2/CMakeLists.txt

      # In pip install case, the lib suffix is `.so.3` instead of `.so`
-      NAMES libnvshmem_host.so libnvshmem_host.so.3
+      NAMES nvshmem_host nvshmem_host.so.3
      PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib


Suggested change

PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib

HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}

PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)

Okay, I don't it actually found NVSHMEM with the hints for some reason...

[ghstack-poisoned]

Skylion007 · 2025-07-07T20:04:45Z

Nit: it just printed -- Found NVSHMEM : not sure if it actually found it, and it's a bit out of place with the orther prints

Skylion007 · 2025-07-07T20:05:12Z

cmake/Summary.cmake

    message(STATUS "    USE_SYSTEM_NCCL     : ${USE_SYSTEM_NCCL}")
  endif()
-  message(STATUS "  NVSHMEM_LIB_DIR       : ${NVSHMEM_LIB_DIR}")
+  message(STATUS "  Found NVSHMEM         : ${NVSHMEM_INCLUDE_DIR}")


This is printed even if message(STATUS " Found NVSHMEM : ${NVSHMEM_INCLUDE_DIR}") is empty...

Empty means not found.
It works on my machine tho:

... -- USE_NCCL : ON -- USE_SYSTEM_NCCL : OFF -- Found NVSHMEM : /usr/local/include -- USE_NNPACK : 0 -- USE_NUMPY : ON ...

Yeah, it's sadly not using the hints. :(

[ghstack-poisoned]

kwen2501 · 2025-07-07T23:09:58Z

caffe2/CMakeLists.txt

-      PATHS $ENV{NVSHMEM_HOME}/include ${NVSHMEM_PY_DIR}/include
+      HINTS $ENV{NVSHMEM_HOME}/include ${NVSHMEM_PY_DIR}/include


We missed one replacement of PATHS -> HINTS previously.
HINTS: searched before default paths.
PATHS: searched after default paths.
Now it should be good. (I tested it locally)

@Skylion007 ^^

kwen2501 · 2025-07-07T23:14:51Z

@pytorchbot merge -f "Tests passed previously; changing from PATHS to HINTS to prioritize search"

pytorchmergebot · 2025-07-07T23:16:27Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501 · 2025-07-08T00:42:58Z

@pytorchbot cherry-pick --onto release/2.8 -c critical

pytorchbot · 2025-07-08T00:47:43Z

Cherry picking #157695

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 3effe0c293219b00a0eae7e139fe2d9aed84bc03 returned non-zero exit code 1

Auto-merging caffe2/CMakeLists.txt
CONFLICT (content): Merge conflict in caffe2/CMakeLists.txt
Auto-merging cmake/Summary.cmake
CONFLICT (content): Merge conflict in cmake/Summary.cmake
Auto-merging test/distributed/test_nvshmem.py
CONFLICT (modify/delete): test/distributed/test_nvshmem_triton.py deleted in HEAD and modified in 3effe0c2932 ([SymmMem] find_path does not search /usr/local/lib (#157695)).  Version 3effe0c2932 ([SymmMem] find_path does not search /usr/local/lib (#157695)) of test/distributed/test_nvshmem_triton.py left in tree.
Auto-merging torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu
error: could not apply 3effe0c2932... [SymmMem] find_path does not search /usr/local/lib (#157695)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

kwen2501 · 2025-07-08T01:09:16Z

@pytorchbot revert

pytorch-bot · 2025-07-08T01:09:19Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message, -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

kwen2501 · 2025-07-08T01:10:15Z

@pytorchbot revert -m "Changing it to be landable on 2.8 branch" -c landrace

pytorchmergebot · 2025-07-08T01:11:50Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 3effe0c. Reverted #157695 on behalf of https://github.com/kwen2501 due to Changing it to be landable on 2.8 branch ([comment](#157695 (comment)))

pytorchmergebot · 2025-07-08T01:12:05Z

@kwen2501 your PR has been successfully reverted.

This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k [ghstack-poisoned]

ghstack-source-id: d607d38 Pull Request resolved: #157695

kwen2501 · 2025-07-08T01:20:04Z

@pytorchbot merge -f "Remove small changes to test_nvshmem_triton.py to make PR patchable onto 2.8 release branch"

pytorchmergebot · 2025-07-08T01:21:42Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501 · 2025-07-08T01:27:27Z

@pytorchbot cherry-pick --onto cherry-pick-157513-by-pytorch_bot_bot_ -c critical

This PR uses `find_library` to replace `find_path`. It also searches for NVSHMEM host lib and device lib separately. Tested against system install location: /usr/local/lib and /usr/local/include. Pull Request resolved: #157695 Approved by: https://github.com/Skylion007 ghstack dependencies: #157513 (cherry picked from commit c558907)

pytorchbot · 2025-07-08T01:32:02Z

Cherry picking #157695

The cherry pick PR is at #157765 and it is recommended to link a critical cherry pick PR with an issue.

Details for Dev Infra team

Raised by workflow job

Before, if NVSHMEM is installed at *BOTH* system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found: ``` -- NVSHMEM_HOME set to: '' -- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem' -- NVSHMEM_HOST_LIB: '/usr/local/lib/libnvshmem_host.so' -- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a' -- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include' ``` The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`. ``` find_library(NVSHMEM_HOST_LIB # In pip install case, the lib suffix is `.so.3` instead of `.so` NAMES nvshmem_host nvshmem_host.so.3 HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR} PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64) ``` This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc: > The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it. After this PR: ``` -- NVSHMEM_HOME set to: '' -- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem' -- NVSHMEM_HOST_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3' -- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a' -- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include' ``` Pull Request resolved: #157836 Approved by: https://github.com/fegin, https://github.com/fduwjj ghstack dependencies: #157513, #157695

[SymmMem] find_path does not search /usr/local/lib

90e5750

[ghstack-poisoned]

Update on "[SymmMem] find_path does not search /usr/local/lib"

9a4d0dc

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jul 7, 2025

[SymmMem] find_path does not search /usr/local/lib

13c4d64

ghstack-source-id: 43a6ecb Pull Request resolved: #157695

kwen2501 requested a review from Skylion007 July 7, 2025 07:19

kwen2501 added the release notes: distributed (symm_mem) release note label for symmetric memory label Jul 7, 2025

Skylion007 approved these changes Jul 7, 2025

View reviewed changes

Skylion007 reviewed Jul 7, 2025

View reviewed changes

Update

d1b55aa

[ghstack-poisoned]

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 7, 2025

Update

d5afe69

[ghstack-poisoned]

Skylion007 reviewed Jul 7, 2025

View reviewed changes

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 7, 2025

Update

e60a597

[ghstack-poisoned]

kwen2501 commented Jul 7, 2025

View reviewed changes

pytorchmergebot added the merging label Jul 7, 2025

pytorchmergebot closed this in 3effe0c Jul 7, 2025

pytorchmergebot added Merged and removed merging labels Jul 7, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jul 8, 2025

pytorchmergebot reopened this Jul 8, 2025

kwen2501 added a commit that referenced this pull request Jul 8, 2025

[SymmMem] find_path does not search /usr/local/lib

8872276

ghstack-source-id: d607d38 Pull Request resolved: #157695

pytorchmergebot added the merging label Jul 8, 2025

pytorchmergebot closed this in c558907 Jul 8, 2025

pytorchmergebot removed the merging label Jul 8, 2025

pytorchbot mentioned this pull request Jul 8, 2025

[SymmMem] find_path does not search /usr/local/lib #157765

Closed

This was referenced Jul 8, 2025

[v.2.8.0] Release Tracker #156745

Open

[Do not land] Test NVSHMEM build #157970

Closed

github-actions bot deleted the gh/kwen2501/192/head branch August 7, 2025 02:21

	PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib
	HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
	PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)

		PATHS $ENV{NVSHMEM_HOME}/include ${NVSHMEM_PY_DIR}/include
		HINTS $ENV{NVSHMEM_HOME}/include ${NVSHMEM_PY_DIR}/include

Conversation

kwen2501 commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157695

⏳ No Failures, 24 Pending

Uh oh!

Skylion007 commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Jul 7, 2025

Uh oh!

pytorchmergebot commented Jul 7, 2025

Merge started

Uh oh!

kwen2501 commented Jul 8, 2025

Uh oh!

pytorchbot commented Jul 8, 2025

Cherry picking #157695

Uh oh!

kwen2501 commented Jul 8, 2025

Uh oh!

pytorch-bot bot commented Jul 8, 2025

Uh oh!

kwen2501 commented Jul 8, 2025

Uh oh!

pytorchmergebot commented Jul 8, 2025

Uh oh!

pytorchmergebot commented Jul 8, 2025

Uh oh!

kwen2501 commented Jul 8, 2025

Uh oh!

pytorchmergebot commented Jul 8, 2025

Merge started

Uh oh!

kwen2501 commented Jul 8, 2025

Uh oh!

pytorchbot commented Jul 8, 2025

Cherry picking #157695

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Jul 7, 2025 •

edited

Loading

pytorch-bot bot commented Jul 7, 2025 •

edited

Loading

Skylion007 Jul 7, 2025 •

edited

Loading

kwen2501 Jul 7, 2025 •

edited

Loading