[SymmMem] Add multimem support for NCCL and NVSHMEM by kwen2501 · Pull Request #172185 · pytorch/pytorch

kwen2501 · 2026-01-11T09:31:35Z

Stack from ghstack (oldest at bottom):

-> [SymmMem] Add multimem support for NCCL and NVSHMEM #172185

[ghstack-poisoned]

pytorch-bot · 2026-01-11T09:31:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172185

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 61 Pending, 1 Unrelated Failure

As of commit 657e3d7 with merge base 8cfe6f1 ():

NEW FAILURE - The following job has failed:

Limited CI for symmetric memory tests on H100 / linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm / test (h100-symm-mem, 1, 1, linux.aws.h100.4) (gh)
Process completed with exit code 1.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1) (gh) (disabled by #165671 but the issue was closed recently and a rebase is needed to make it pass)
test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: fdf9a3d Pull-Request: #172185

[ghstack-poisoned]

ghstack-source-id: 17734cb Pull-Request: #172185

[ghstack-poisoned]

ghstack-source-id: e4a66ae Pull-Request: #172185

kwen2501 · 2026-01-12T17:00:51Z

@pytorchbot merge

pytorchmergebot · 2026-01-12T17:02:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-01-12T17:34:38Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

ghstack-source-id: d00b8d9 Pull-Request: #172185

[ghstack-poisoned]

ghstack-source-id: c862e4c Pull-Request: #172185

kwen2501 · 2026-01-12T21:35:35Z

@pytorchbot merge

pytorchmergebot · 2026-01-14T06:28:24Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2026-01-15T05:38:57Z

@pytorchbot revert -m 'Sorry for reverting the change but I think it is failing vLLM benchmark job' -c nosignal

https://github.com/pytorch/pytorch/actions/runs/20995910399/job/60431079759#step:15:15768 with the error RuntimeError: Worker failed with error 'The has_multicast_supportAPI is deprecated for SymmetricMemory handles. You can check ifget_multicast_ptrreturns a non-null pointer, or use thec10d::symmetric_memory::has_multicast_support API instead.', please check the stack trace above for the root cause

That job supposes to use NCCL 2.28.9 https://github.com/pytorch/pytorch/actions/runs/20995910399/job/60431079519#step:7:1134 coming from our pin https://github.com/pytorch/pytorch/blob/main/.ci/docker/ci_commit_pins/nccl.txt

Let me know if this is an infra issue that we need to fix to unblock this change because I only see it failing on DGX B200

cc @zou3519

pytorchmergebot · 2026-01-15T05:40:39Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

huydhn · 2026-01-15T05:40:45Z

Let me also check why https://github.com/pytorch/pytorch/actions/runs/20995910399/job/60431079759#step:15:15768 is reported as a success while it clearly should fail

pytorchmergebot · 2026-01-15T05:40:47Z

@kwen2501 your PR has been successfully reverted.

This reverts commit ed935ff. Reverted #172185 on behalf of https://github.com/huydhn due to Sorry for reverting the change but I think it is failing vLLM benchmark job ([comment](#172185 (comment)))

…172185)" This reverts commit 1c83214. Reverted pytorch#172185 on behalf of https://github.com/Skylion007 due to breaking CI builds with new nvshmem. See pytorch#172348 ([comment](pytorch#172185 (comment)))

Pull Request resolved: pytorch#172185 Approved by: https://github.com/Skylion007, https://github.com/dzmitry-huba ghstack dependencies: pytorch#172163

…172185)" This reverts commit ed935ff. Reverted pytorch#172185 on behalf of https://github.com/huydhn due to Sorry for reverting the change but I think it is failing vLLM benchmark job ([comment](pytorch#172185 (comment)))

[ghstack-poisoned]

ghstack-source-id: 4a7cab3 Pull-Request: #172185 patch

kwen2501 · 2026-01-16T19:46:26Z

Hi @huydhn sorry, the break is caused by the PR dropping support of some APIs.
I've added the support back.
Trying to re-land now.

kwen2501 · 2026-01-16T19:47:43Z

@pytorchbot merge

pytorchmergebot · 2026-01-16T19:49:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-01-16T20:32:09Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team

Raised by workflow job

kwen2501 · 2026-01-16T21:33:23Z

@pytorchbot merge -f "Failed Rocm test/inductor/test_cuda_repro.py has been identified as flaky"

pytorchmergebot · 2026-01-16T21:35:35Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Cherry-picked from upstream main: - [SymmMem] Back symm_mem.empty() with implicit pool (pytorch#172292) Automatic memory reuse for symmetric memory allocations - [SymmMem] Add multimem support for NCCL and NVSHMEM (pytorch#172185) Enhanced multi-GPU memory support - [inductor] Basic Comm Buffer Reuse for Symmetric Memory (pytorch#171909) Memory optimization for torch.compile with symmetric buffers - [BE] Don't print 12 `triton not found` on import (pytorch#172614) QoL fix for flop_counter imports - [inductor] Use custom triton kernel subclass when available (pytorch#167456) Enables custom backend heuristics for Triton kernels Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update

5f5ea90

[ghstack-poisoned]

pytorch-bot bot added ciflow/h100-symm-mem release notes: distributed (c10d) release notes category labels Jan 11, 2026

kwen2501 added a commit that referenced this pull request Jan 11, 2026

[SymmMem] Add multimem support for NCCL and NVSHMEM

138c285

ghstack-source-id: fdf9a3d Pull-Request: #172185

kwen2501 added module: symm_mem Issues and PRs of Symmetric Memory release notes: distributed (symm_mem) release note label for symmetric memory labels Jan 11, 2026

pytorchbot added the open source label Jan 11, 2026

Skylion007 approved these changes Jan 11, 2026

View reviewed changes

Update

a69bc19

[ghstack-poisoned]

kwen2501 requested a review from a team as a code owner January 12, 2026 06:58

kwen2501 added a commit that referenced this pull request Jan 12, 2026

[SymmMem] Add multimem support for NCCL and NVSHMEM

75b7734

ghstack-source-id: 17734cb Pull-Request: #172185

kwen2501 requested review from dzmitry-huba, fduwjj, fegin and ngimel January 12, 2026 07:04

dzmitry-huba approved these changes Jan 12, 2026

View reviewed changes

Update

7908e98

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jan 12, 2026

[SymmMem] Add multimem support for NCCL and NVSHMEM

4d0538c

ghstack-source-id: e4a66ae Pull-Request: #172185

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 12, 2026

pytorchmergebot added the merging label Jan 12, 2026

pytorchmergebot removed the merging label Jan 12, 2026

Update

457c19b

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jan 12, 2026

[SymmMem] Add multimem support for NCCL and NVSHMEM

26ad0ab

ghstack-source-id: d00b8d9 Pull-Request: #172185

Update

ada379e

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jan 12, 2026

[SymmMem] Add multimem support for NCCL and NVSHMEM

4ff284e

ghstack-source-id: c862e4c Pull-Request: #172185

pytorchmergebot added the merging label Jan 14, 2026

pytorchmergebot closed this in ed935ff Jan 14, 2026

pytorchmergebot removed the merging label Jan 14, 2026

pytorchmergebot reopened this Jan 15, 2026

Update

657e3d7

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jan 16, 2026

[SymmMem] Add multimem support for NCCL and NVSHMEM

bc46079

ghstack-source-id: 4a7cab3 Pull-Request: #172185 patch

pytorchmergebot added the merging label Jan 16, 2026

pytorchmergebot removed the merging label Jan 16, 2026

pytorchmergebot added the merging label Jan 16, 2026

pytorchmergebot closed this in 89a5443 Jan 16, 2026

pytorchmergebot removed the merging label Jan 16, 2026

github-actions bot deleted the gh/kwen2501/306/head branch February 16, 2026 02:23

Conversation

kwen2501 commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172185

❌ 1 New Failure, 61 Pending, 1 Unrelated Failure

Uh oh!

kwen2501 commented Jan 12, 2026

Uh oh!

pytorchmergebot commented Jan 12, 2026

Merge started

Uh oh!

pytorchmergebot commented Jan 12, 2026

Merge failed

Uh oh!

kwen2501 commented Jan 12, 2026

Uh oh!

pytorchmergebot commented Jan 14, 2026

Merge started

Uh oh!

huydhn commented Jan 15, 2026

Uh oh!

pytorchmergebot commented Jan 15, 2026

Uh oh!

huydhn commented Jan 15, 2026

Uh oh!

pytorchmergebot commented Jan 15, 2026

Uh oh!

kwen2501 commented Jan 16, 2026

Uh oh!

kwen2501 commented Jan 16, 2026

Uh oh!

pytorchmergebot commented Jan 16, 2026

Merge started

Uh oh!

pytorchmergebot commented Jan 16, 2026

Merge failed

Uh oh!

kwen2501 commented Jan 16, 2026

Uh oh!

pytorchmergebot commented Jan 16, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kwen2501 commented Jan 11, 2026 •

edited

Loading

pytorch-bot bot commented Jan 11, 2026 •

edited

Loading