Skip to content

[SymmMem] Find NVSHMEM from system installation#157513

Closed
kwen2501 wants to merge 5 commits intogh/kwen2501/189/basefrom
gh/kwen2501/189/head
Closed

[SymmMem] Find NVSHMEM from system installation#157513
kwen2501 wants to merge 5 commits intogh/kwen2501/189/basefrom
gh/kwen2501/189/head

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Jul 2, 2025

Stack from ghstack (oldest at bottom):

Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into /usr/local on our CI machines.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @Skylion007

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157513

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 48 Pending

As of commit 1c92bd6 with merge base 7597988 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jul 2, 2025
ghstack-source-id: 42dcc41
Pull-Request-resolved: #157513
@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 2, 2025
[ghstack-poisoned]
[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jul 3, 2025
ghstack-source-id: d2311dd
Pull-Request-resolved: #157513
Comment on lines +11 to +20

// NVSHMEM minimum SM arch
#define _NVSHMEM_MIN_SM_ARCH 700

// Some NVSHMEM device APIs do not compile on older SM archs
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < _NVSHMEM_MIN_SM_ARCH)
// Only include host APIs. See nvshmem.h for details.
#define NVSHMEM_HOSTLIB_ONLY
#endif // Must be done before nvshmem.h is included

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atalman Solving the sm_52 compile issue.

build_options["NVSHMEM_HOME"] = nvshmem_home
nvshmem_py_dir = py_lib_path + "/nvidia/nvshmem"
if os.path.exists(nvshmem_py_dir):
print("NVSHMEM wheel install found: " + nvshmem_py_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This print should be removed before merge

@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 3, 2025
Previously we only search for NVSHMEM from pip install location. 
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 

[ghstack-poisoned]
@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 4, 2025

@pytorchbot merge -f "CI passed; removing a print statement"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Skylion007
Copy link
Collaborator

I never actually exported NVSHMEM for the builder so this patch is missing one more part:
According to the logs:

-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  ''
-- NVSHMEM not found, not building with NVSHMEM support.

So we still need one more component sadly to properly set NVSHMEM...

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need some more NVSHMEM locations set

find_path(NVSHMEM_LIB_DIR
# In pip install case, the lib suffix is `.so.3` instead of `.so`
NAMES libnvshmem_host.so libnvshmem_host.so.3
PATHS $ENV{NVSHMEM_HOME}/lib ${NVSHMEM_PY_DIR}/lib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never specify NVSHMEM_HOME so it's not building with it...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg, I thought find_path would search for default paths if the file is not found from the specified paths.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question: why did the the original impl fail to extrat NVSHMEM from system installation? Thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinjn because NVSHMEM_HOME wasn't set to system install location unless user sets it. The PR uses CMake's find_... to search for it.

pytorchmergebot pushed a commit that referenced this pull request Jul 4, 2025
`maybe_initialize_env_vars` and `initialize_nvshmem_with_store` are only used in `NVSHMEMSymmetricMemory.cu`. Moving them there.

Pull Request resolved: #157611
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
pytorchmergebot pushed a commit that referenced this pull request Jul 7, 2025
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: #157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 8, 2025

@pytorchbot cherry-pick --onto release/2.8 -c Compilation

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 8, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot cherry-pick: error: argument -c/--classification: invalid choice: 'Compilation' (choose from 'regression', 'critical', 'fixnewfeature', 'docs', 'release')

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Try @pytorchbot --help for more info.

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Jul 8, 2025

@pytorchbot cherry-pick --onto release/2.8 -c critical

pytorchbot pushed a commit that referenced this pull request Jul 8, 2025
Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

Pull Request resolved: #157513
Approved by: https://github.com/atalman, https://github.com/Skylion007

(cherry picked from commit 99c1a6b)
@pytorchbot
Copy link
Collaborator

Cherry picking #157513

The cherry pick PR is at #157755 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

pytorchmergebot pushed a commit that referenced this pull request Jul 8, 2025
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: #157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
pytorchbot pushed a commit that referenced this pull request Jul 8, 2025
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: #157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513

(cherry picked from commit c558907)
pytorchmergebot pushed a commit that referenced this pull request Jul 14, 2025
Before, if NVSHMEM is installed at *BOTH* system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found:
```
-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB:  '/usr/local/lib/libnvshmem_host.so'
-- NVSHMEM_DEVICE_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```

The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`.
```
find_library(NVSHMEM_HOST_LIB
      # In pip install case, the lib suffix is `.so.3` instead of `.so`
      NAMES nvshmem_host nvshmem_host.so.3
      HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
      PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)
```

This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc:
> The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it.

After this PR:
```
-- NVSHMEM_HOME set to:  ''
-- NVSHMEM wheel installed at:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3'
-- NVSHMEM_DEVICE_LIB:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR:  '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
Pull Request resolved: #157836
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #157513, #157695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (symm_mem) release note label for symmetric memory

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants