Skip to content

[BE] Update nvshem dependency to 3.3.20#160458

Closed
malfet wants to merge 7 commits intogh/malfet/484/basefrom
gh/malfet/484/head
Closed

[BE] Update nvshem dependency to 3.3.20#160458
malfet wants to merge 7 commits intogh/malfet/484/basefrom
gh/malfet/484/head

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Aug 12, 2025

Stack from ghstack (oldest at bottom):

Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package libnvshmem_host.so.3 into gigantic aarch64+CUDA wheel
Should fix #160425

[ghstack-poisoned]
@malfet malfet requested review from a team and jeffdaily as code owners August 12, 2025 20:28
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160458

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 134 Pending

As of commit afa4438 with merge base f782c79 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Aug 12, 2025
malfet added a commit that referenced this pull request Aug 12, 2025
Which is manylinux2_28 compatible, even on aarch64 platform
Should fix #160425

ghstack-source-id: dcc80a4
Pull Request resolved: #160458
@malfet
Copy link
Contributor Author

malfet commented Aug 12, 2025

And surprise, nvshem tarball for 3.3.20 is not publish to https://developer.download.nvidia.com/compute/redist/nvshmem/

Copy link
Collaborator

@tinglvv tinglvv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the NVSHMEM modified the location that they officially publish packages --

For the install_nvshmem function:

filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}"
url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"

changing to below should work:

filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"

url="https://developer.download.nvidia.com/compute/nvshmem/redist/libnvshmem/linux-${arch_path}/${filename}.tar.xz"

@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 12, 2025

Not sure if these are needed, but a search with 3.3.9 also produces the following hits:

PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'

PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'

To be consistent, we might as well update all 3.3.9 to 3.3.20.

@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 13, 2025

Hi @malfet , we fixed the link and the 3.3.20 is available

https://developer.download.nvidia.com/compute/redist/nvshmem/3.3.20/builds/cuda12/txz/agnostic/

Please note that these wheels now seem to have suffix of "xz" , so

filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}"
url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"
would still need to be updated accordingly.

[ghstack-poisoned]
malfet added a commit that referenced this pull request Aug 14, 2025
Which is manylinux2_28 compatible, even on aarch64 platform
Should fix #160425

ghstack-source-id: 2dd52d5
Pull Request resolved: #160458
@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 15, 2025

Uh, in addition to .gz to .xz change, the file name seems to also have changes:
libnvshmem-linux-x86_64-3.3.20_cuda12-archive.tar.xz instead of
libnvshmem_cuda12-linux-x86_64-3.3.20.tar.xz

"
changing to below should work:
filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"
"

Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh looks like there are other references to tar.gz.

e.g. after wget -q "${url}"
we have:
tar xf "${filename}.tar.gz"

So this needs yet another update...

[ghstack-poisoned]
malfet added a commit that referenced this pull request Aug 15, 2025
Which is manylinux2_28 compatible, even on aarch64 platform
Should fix #160425

ghstack-source-id: 0164b57
Pull Request resolved: #160458
@malfet malfet dismissed nWEIdia’s stale review August 15, 2025 02:27

Updated, and obviously I didn't plan to merge -f with all the signals red

@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 15, 2025

I guess I should have stacked the comments together:

filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"

I promise this should be THE last change that is needed.

[ghstack-poisoned]
malfet added a commit that referenced this pull request Aug 15, 2025
Which is manylinux2_28 compatible, even on aarch64 platform
Should fix #160425

ghstack-source-id: 1977470
Pull Request resolved: #160458
[ghstack-poisoned]
malfet added a commit that referenced this pull request Aug 15, 2025
Which is manylinux2_28 compatible, even on aarch64 platform
Should fix #160425

ghstack-source-id: d20634b
Pull Request resolved: #160458
@atalman atalman added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Aug 15, 2025
Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have to eat my own words again.
the following:
cp -a "libnvshmem/include/"* /usr/local/cuda/include/
cp -a "libnvshmem/lib/"* /usr/local/cuda/lib64/
needs to be changed to:
cp -a "${filename}/include/"* /usr/local/cuda/include/
cp -a "${filename}/lib"* /usr/local/cuda/lib64

to accommodate our changes from the released tar.xz file.
v3.3.9, when extracted, we get libnvshmem,
as v3.3.20 currently stands, we get. e.g. libnvshmem-linux-x86_64-3.3.20_cuda12-archive , apologies for the inconveniences!

@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 16, 2025

Follow up URL PR would be done in #160201 by @tinglvv

I would prefer to keep fixing glitches in the script (which is BE work) to be kept separate from CUDA-13 bringup

That has been the recommendation. @tinglvv up to you now :)

@tinglvv
Copy link
Collaborator

tinglvv commented Aug 16, 2025

To clarify, I had to fix the URL to unblock the CUDA 13 PR initially since build was failing(when packages were not moved to the current location), so fixing the URL was a good added bonus before..

Since we merged this PR and the CUDA 13 one will needs rebase, I will open a separate PR to update the URL.

@wdvr
Copy link
Contributor

wdvr commented Aug 16, 2025

@pytorchmergebot revert -m "need to rerun workflow generation (failing workflow-checks)" -c landrace

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Aug 16, 2025
This reverts commit e0488d9.

Reverted #160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](#160458 (comment)))
@pytorchmergebot
Copy link
Collaborator

@malfet your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Aug 16, 2025
@malfet
Copy link
Contributor Author

malfet commented Aug 16, 2025

@pytorchmergebot revert -m "need to rerun workflow generation (failing workflow-checks)" -c nosignal

@wdvr it should have been a landrace, with @atalman 's #160788

[ghstack-poisoned]
malfet added a commit that referenced this pull request Aug 16, 2025
Which is manylinux2_28 compatible, even on aarch64 platform
Should fix #160425

ghstack-source-id: 6eb2456
Pull Request resolved: #160458
@malfet
Copy link
Contributor Author

malfet commented Aug 16, 2025

@pytorchbot merge -f "2nd time is the charm"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix pytorch#160425
Pull Request resolved: pytorch#160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
This reverts commit e0488d9.

Reverted pytorch#160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](pytorch#160458 (comment)))
can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix pytorch#160425
Pull Request resolved: pytorch#160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
@github-actions github-actions bot deleted the gh/malfet/484/head branch September 16, 2025 02:07
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix pytorch#160425
Pull Request resolved: pytorch#160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
This reverts commit e0488d9.

Reverted pytorch#160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](pytorch#160458 (comment)))
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix pytorch#160425
Pull Request resolved: pytorch#160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/binaries Trigger all binary build and upload jobs on the PR Merged Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants