Use same NVSHMEM version across CUDA builds by kwen2501 · Pull Request #162206 · pytorch/pytorch

kwen2501 · 2025-09-04T20:31:33Z

Stack from ghstack (oldest at bottom):

-> Use same NVSHMEM version across CUDA builds #162206

#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.

[ghstack-poisoned]

pytorch-bot · 2025-09-04T20:31:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162206

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 194cd4e with merge base 1f0b01d ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (similar failure)
test_sparse.py::TestSparseMPS::test_div_by_sparse_error_mps
trunk / win-vs2022-cpu-py3 / test (default, 1, 3, lf.windows.4xlarge.nonephemeral) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 134aa80 Pull-Request-resolved: #162206

kwen2501 · 2025-09-04T20:35:02Z

@tinglvv Does this PR make sense? Can you please review? Thanks!

tinglvv · 2025-09-04T20:47:27Z

Thanks! Looks good.

kwen2501 · 2025-09-04T21:09:35Z

@pytorchbot merge

pytorchmergebot · 2025-09-04T21:11:34Z

Merge failed

Reason: Approvers from one of the following sets are needed:

OSS CI (alband, dagitses, pytorch/pytorch-dev-infra)
superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Skylion007 · 2025-09-04T22:25:33Z

Perfect, I was planning on doing this anymore:

pytorch/.ci/docker/common/install_cuda.sh

Line 13 in b04e922

NVSHMEM_VERSION=3.3.24

Looks like it's already properly updated for the static builds.

Skylion007 · 2025-09-04T22:25:51Z

@pytorchbot merge

pytorchmergebot · 2025-09-04T22:36:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-04T23:41:56Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_12-cuda12_8-test / test

Details for Dev Infra team

Raised by workflow job

Skylion007 · 2025-09-05T00:58:23Z

@atalman We need some new S3 uploads for nvidia wheels

kwen2501 · 2025-09-09T01:20:56Z

@pytorchbot merge

pytorchmergebot · 2025-09-09T01:23:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-09T01:39:21Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x cf1eb07b4d98ccb3890d4205c8c131ed4a021d86 returned non-zero exit code 1

Auto-merging .github/scripts/generate_binary_build_matrix.py
CONFLICT (content): Merge conflict in .github/scripts/generate_binary_build_matrix.py
Auto-merging .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
Auto-merging .github/workflows/generated-linux-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-binary-manywheel-nightly.yml
error: could not apply cf1eb07b4d9... Use same NVSHMEM version across CUDA builds
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: #162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007 ghstack-source-id: d5589c4

kwen2501 · 2025-09-09T16:23:12Z

There is a land conflict which leads to mismatched yml generation.
Fixed.

kwen2501 · 2025-09-09T16:23:21Z

@pytorchbot merge

pytorchmergebot · 2025-09-09T16:25:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-09T19:36:33Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

Skylion007 · 2025-09-09T20:35:26Z

@kwen2501 New NVSHMEM just dropped on PYPI that can use IBGDA on more devices. Should we upgrade it across the board?

Skylion007 · 2025-09-09T20:35:45Z

@pytorchbot merge

pytorchmergebot · 2025-09-09T20:37:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501 · 2025-09-10T00:33:09Z

@Skylion007 Thanks!
Yep, makes sense to upgrade to 3.4 in main.
For 2.9 release, can we keep it 3.3 just to be safe? :)

Skylion007 · 2025-09-12T16:28:44Z

@Skylion007 Thanks!

Yep, makes sense to upgrade to 3.4 in main.

For 2.9 release, can we keep it 3.3 just to be safe? :)

Let's do it

pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007

This reverts commit 0d9c95c. Reverted pytorch#162206 on behalf of https://github.com/malfet due to Broke lint, see https://hud.pytorch.org/hud/pytorch/pytorch/4dd73e659a8fd4872e5f49cfd72e420fa7c4e6c9/1?per_page=50&name_filter=workflow-checks ([comment](pytorch#162206 (comment)))

pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007

This reverts commit 0d9c95c. Reverted pytorch#162206 on behalf of https://github.com/malfet due to Broke lint, see https://hud.pytorch.org/hud/pytorch/pytorch/4dd73e659a8fd4872e5f49cfd72e420fa7c4e6c9/1?per_page=50&name_filter=workflow-checks ([comment](pytorch#162206 (comment)))

pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007

This reverts commit 0d9c95c. Reverted pytorch#162206 on behalf of https://github.com/malfet due to Broke lint, see https://hud.pytorch.org/hud/pytorch/pytorch/4dd73e659a8fd4872e5f49cfd72e420fa7c4e6c9/1?per_page=50&name_filter=workflow-checks ([comment](pytorch#162206 (comment)))

pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007

This reverts commit 0d9c95c. Reverted pytorch#162206 on behalf of https://github.com/malfet due to Broke lint, see https://hud.pytorch.org/hud/pytorch/pytorch/4dd73e659a8fd4872e5f49cfd72e420fa7c4e6c9/1?per_page=50&name_filter=workflow-checks ([comment](pytorch#162206 (comment)))

pytorch#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: pytorch#162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007

Update

cb5ecca

[ghstack-poisoned]

kwen2501 requested a review from a team as a code owner September 4, 2025 20:31

kwen2501 added a commit that referenced this pull request Sep 4, 2025

Use same NVSHMEM version across CUDA builds

cf1eb07

ghstack-source-id: 134aa80 Pull-Request-resolved: #162206

pytorch-bot Bot added the topic: not user facing topic category label Sep 4, 2025

kwen2501 requested review from Skylion007, atalman and tinglvv September 4, 2025 20:32

kwen2501 mentioned this pull request Sep 4, 2025

[CD] Add CUDA 13.0 x86 nightly builds #160956

Closed

tinglvv approved these changes Sep 4, 2025

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 4, 2025

pytorchmergebot added the merging label Sep 4, 2025

pytorchmergebot removed the merging label Sep 4, 2025

Skylion007 approved these changes Sep 4, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 4, 2025

pytorchmergebot removed the merging label Sep 4, 2025

pytorchmergebot added the merging label Sep 9, 2025

pytorchmergebot removed the merging label Sep 9, 2025

Update

194cd4e

[ghstack-poisoned]

pytorchmergebot added the merging label Sep 9, 2025

pytorchmergebot removed the merging label Sep 9, 2025

pytorchmergebot added the merging label Sep 9, 2025

pytorchmergebot closed this in 8922bbc Sep 9, 2025

pytorchmergebot removed the merging label Sep 9, 2025

github-actions Bot deleted the gh/kwen2501/230/head branch October 13, 2025 02:15

Conversation

kwen2501 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162206

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

kwen2501 commented Sep 4, 2025

Uh oh!

tinglvv commented Sep 4, 2025

Uh oh!

kwen2501 commented Sep 4, 2025

Uh oh!

pytorchmergebot commented Sep 4, 2025

Merge failed

Uh oh!

Skylion007 commented Sep 4, 2025

Uh oh!

Skylion007 commented Sep 4, 2025

Uh oh!

pytorchmergebot commented Sep 4, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 4, 2025

Merge failed

Uh oh!

Skylion007 commented Sep 5, 2025

Uh oh!

kwen2501 commented Sep 9, 2025

Uh oh!

pytorchmergebot commented Sep 9, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 9, 2025

Merge failed

Uh oh!

kwen2501 commented Sep 9, 2025

Uh oh!

kwen2501 commented Sep 9, 2025

Uh oh!

pytorchmergebot commented Sep 9, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 9, 2025

Merge failed

Uh oh!

Skylion007 commented Sep 9, 2025

Uh oh!

Skylion007 commented Sep 9, 2025

Uh oh!

pytorchmergebot commented Sep 9, 2025

Merge started

Uh oh!

kwen2501 commented Sep 10, 2025

Uh oh!

Skylion007 commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Sep 4, 2025 •

edited

Loading

pytorch-bot Bot commented Sep 4, 2025 •

edited

Loading