[BE] Update nvshem dependency to 3.3.20 by malfet · Pull Request #160458 · pytorch/pytorch

malfet · 2025-08-12T20:28:49Z

Stack from ghstack (oldest at bottom):

-> [BE] Update nvshem dependency to 3.3.20 #160458

Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package libnvshmem_host.so.3 into gigantic aarch64+CUDA wheel
Should fix #160425

[ghstack-poisoned]

pytorch-bot · 2025-08-12T20:28:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160458

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 134 Pending

As of commit afa4438 with merge base f782c79 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: dcc80a4 Pull Request resolved: #160458

malfet · 2025-08-12T21:26:59Z

And surprise, nvshem tarball for 3.3.20 is not publish to https://developer.download.nvidia.com/compute/redist/nvshmem/

tinglvv

Seems the NVSHMEM modified the location that they officially publish packages --

For the install_nvshmem function:

filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}"
url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"

changing to below should work:

filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"

url="https://developer.download.nvidia.com/compute/nvshmem/redist/libnvshmem/linux-${arch_path}/${filename}.tar.xz"

nWEIdia · 2025-08-12T23:26:10Z

Not sure if these are needed, but a search with 3.3.9 also produces the following hits:

pytorch/.github/workflows/generated-linux-binary-manywheel-main.yml

Line 63 in 0d71ca2

    
           PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'

pytorch/.github/workflows/generated-linux-binary-manywheel-nightly.yml

Line 130 in 0d71ca2

    
           PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'

To be consistent, we might as well update all 3.3.9 to 3.3.20.

nWEIdia · 2025-08-13T19:52:41Z

Hi @malfet , we fixed the link and the 3.3.20 is available

https://developer.download.nvidia.com/compute/redist/nvshmem/3.3.20/builds/cuda12/txz/agnostic/

Please note that these wheels now seem to have suffix of "xz" , so

pytorch/.ci/docker/common/install_cuda.sh

Lines 65 to 66 in a354fa9

    
           filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}" 
        
           url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"

would still need to be updated accordingly.

[ghstack-poisoned]

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: 2dd52d5 Pull Request resolved: #160458

nWEIdia · 2025-08-15T00:41:39Z

Uh, in addition to .gz to .xz change, the file name seems to also have changes:
libnvshmem-linux-x86_64-3.3.20_cuda12-archive.tar.xz instead of
libnvshmem_cuda12-linux-x86_64-3.3.20.tar.xz

"
changing to below should work:
filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"
"

nWEIdia

Uh looks like there are other references to tar.gz.

e.g. after wget -q "${url}"
we have:
tar xf "${filename}.tar.gz"

So this needs yet another update...

[ghstack-poisoned]

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: 0164b57 Pull Request resolved: #160458

Updated, and obviously I didn't plan to merge -f with all the signals red

nWEIdia · 2025-08-15T06:53:47Z

I guess I should have stacked the comments together:

filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"

I promise this should be THE last change that is needed.

.ci/docker/common/install_cuda.sh

[ghstack-poisoned]

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: 1977470 Pull Request resolved: #160458

.ci/docker/common/install_cuda.sh

[ghstack-poisoned]

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: d20634b Pull Request resolved: #160458

nWEIdia

Sorry, I have to eat my own words again.
the following:
cp -a "libnvshmem/include/"* /usr/local/cuda/include/
cp -a "libnvshmem/lib/"* /usr/local/cuda/lib64/
needs to be changed to:
cp -a "${filename}/include/"* /usr/local/cuda/include/
cp -a "${filename}/lib"* /usr/local/cuda/lib64

to accommodate our changes from the released tar.xz file.
v3.3.9, when extracted, we get libnvshmem,
as v3.3.20 currently stands, we get. e.g. libnvshmem-linux-x86_64-3.3.20_cuda12-archive , apologies for the inconveniences!

nWEIdia · 2025-08-16T00:50:06Z

Follow up URL PR would be done in #160201 by @tinglvv

I would prefer to keep fixing glitches in the script (which is BE work) to be kept separate from CUDA-13 bringup

That has been the recommendation. @tinglvv up to you now :)

tinglvv · 2025-08-16T00:53:28Z

To clarify, I had to fix the URL to unblock the CUDA 13 PR initially since build was failing(when packages were not moved to the current location), so fixing the URL was a good added bonus before..

Since we merged this PR and the CUDA 13 one will needs rebase, I will open a separate PR to update the URL.

wdvr · 2025-08-16T01:45:20Z

@pytorchmergebot revert -m "need to rerun workflow generation (failing workflow-checks)" -c landrace

pytorchmergebot · 2025-08-16T01:47:33Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit e0488d9. Reverted #160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](#160458 (comment)))

pytorchmergebot · 2025-08-16T01:47:46Z

@malfet your PR has been successfully reverted.

malfet · 2025-08-16T01:55:55Z

@pytorchmergebot revert -m "need to rerun workflow generation (failing workflow-checks)" -c nosignal

@wdvr it should have been a landrace, with @atalman 's #160788

[ghstack-poisoned]

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: 6eb2456 Pull Request resolved: #160458

malfet · 2025-08-16T01:58:52Z

@pytorchbot merge -f "2nd time is the charm"

pytorchmergebot · 2025-08-16T02:00:46Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix pytorch#160425 Pull Request resolved: pytorch#160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv

This reverts commit e0488d9. Reverted pytorch#160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](pytorch#160458 (comment)))

Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix pytorch#160425 Pull Request resolved: pytorch#160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv

This reverts commit e0488d9. Reverted pytorch#160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](pytorch#160458 (comment)))

Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix pytorch#160425 Pull Request resolved: pytorch#160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv

Update

a36e5ec

[ghstack-poisoned]

malfet requested review from a team and jeffdaily as code owners August 12, 2025 20:28

pytorch-bot bot added the topic: not user facing topic category label Aug 12, 2025

malfet added a commit that referenced this pull request Aug 12, 2025

[BE] Update nvshem dependency to 3.3.20

57a96b9

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: dcc80a4 Pull Request resolved: #160458

Skylion007 approved these changes Aug 12, 2025

View reviewed changes

kwen2501 approved these changes Aug 12, 2025

View reviewed changes

tinglvv reviewed Aug 12, 2025

View reviewed changes

Update

104cac3

[ghstack-poisoned]

malfet added a commit that referenced this pull request Aug 14, 2025

[BE] Update nvshem dependency to 3.3.20

b37a1c2

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: 2dd52d5 Pull Request resolved: #160458

nWEIdia previously requested changes Aug 15, 2025

View reviewed changes

Update

cd6de16

[ghstack-poisoned]

malfet added a commit that referenced this pull request Aug 15, 2025

[BE] Update nvshem dependency to 3.3.20

c7306c9

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: 0164b57 Pull Request resolved: #160458

nWEIdia reviewed Aug 15, 2025

View reviewed changes

.ci/docker/common/install_cuda.sh Outdated Show resolved Hide resolved

Update

0207887

[ghstack-poisoned]

malfet added a commit that referenced this pull request Aug 15, 2025

[BE] Update nvshem dependency to 3.3.20

e4ebc8f

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: 1977470 Pull Request resolved: #160458

Skylion007 reviewed Aug 15, 2025

View reviewed changes

.ci/docker/common/install_cuda.sh Outdated Show resolved Hide resolved

Update

5427e5f

[ghstack-poisoned]

malfet added a commit that referenced this pull request Aug 15, 2025

[BE] Update nvshem dependency to 3.3.20

bf5c750

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: d20634b Pull Request resolved: #160458

atalman added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Aug 15, 2025

atalman approved these changes Aug 15, 2025

View reviewed changes

nWEIdia approved these changes Aug 15, 2025

View reviewed changes

nWEIdia requested changes Aug 15, 2025

View reviewed changes

pytorchmergebot closed this in e0488d9 Aug 16, 2025

pytorchmergebot added Merged and removed merging labels Aug 16, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Aug 16, 2025

pytorchmergebot reopened this Aug 16, 2025

Update

afa4438

[ghstack-poisoned]

malfet added a commit that referenced this pull request Aug 16, 2025

[BE] Update nvshem dependency to 3.3.20

8175196

Which is manylinux2_28 compatible, even on aarch64 platform Should fix #160425 ghstack-source-id: 6eb2456 Pull Request resolved: #160458

pytorchmergebot added the merging label Aug 16, 2025

pytorchmergebot closed this in 7bd4cfa Aug 16, 2025

pytorchmergebot removed the merging label Aug 16, 2025

malfet mentioned this pull request Aug 18, 2025

[ci/cd][nightly] Linux-binary-libtorch - nvshmem errors #160877

Closed

github-actions bot deleted the gh/malfet/484/head branch September 16, 2025 02:07

Conversation

malfet commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160458

⏳ No Failures, 134 Pending

Uh oh!

malfet commented Aug 12, 2025

Uh oh!

tinglvv left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nWEIdia commented Aug 12, 2025

Uh oh!

nWEIdia commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia left a comment

Choose a reason for hiding this comment

Uh oh!

nWEIdia commented Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

nWEIdia left a comment

Choose a reason for hiding this comment

Uh oh!

nWEIdia commented Aug 16, 2025

Uh oh!

tinglvv commented Aug 16, 2025

Uh oh!

wdvr commented Aug 16, 2025 • edited by clee2000 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Aug 16, 2025

Uh oh!

pytorchmergebot commented Aug 16, 2025

Uh oh!

malfet commented Aug 16, 2025

Uh oh!

malfet commented Aug 16, 2025

Uh oh!

pytorchmergebot commented Aug 16, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

malfet commented Aug 12, 2025 •

edited

Loading

pytorch-bot bot commented Aug 12, 2025 •

edited

Loading

tinglvv left a comment •

edited

Loading

nWEIdia commented Aug 13, 2025 •

edited

Loading

nWEIdia commented Aug 15, 2025 •

edited

Loading

wdvr commented Aug 16, 2025 •

edited by clee2000

Loading