Skip to content

[SymmMem] Add multimem support for NCCL and NVSHMEM#172185

Closed
kwen2501 wants to merge 7 commits intogh/kwen2501/306/basefrom
gh/kwen2501/306/head
Closed

[SymmMem] Add multimem support for NCCL and NVSHMEM#172185
kwen2501 wants to merge 7 commits intogh/kwen2501/306/basefrom
gh/kwen2501/306/head

Conversation

@kwen2501
Copy link
Copy Markdown
Collaborator

@kwen2501 kwen2501 commented Jan 11, 2026

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172185

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 61 Pending, 1 Unrelated Failure

As of commit 657e3d7 with merge base 8cfe6f1 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request Jan 11, 2026
@kwen2501 kwen2501 added module: symm_mem Issues and PRs of Symmetric Memory release notes: distributed (symm_mem) release note label for symmetric memory labels Jan 11, 2026
[ghstack-poisoned]
@kwen2501 kwen2501 requested a review from a team as a code owner January 12, 2026 06:58
kwen2501 added a commit that referenced this pull request Jan 12, 2026
[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jan 12, 2026
@kwen2501
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 12, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jan 12, 2026
[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jan 12, 2026
@kwen2501
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Jan 15, 2026

@pytorchbot revert -m 'Sorry for reverting the change but I think it is failing vLLM benchmark job' -c nosignal

https://github.com/pytorch/pytorch/actions/runs/20995910399/job/60431079759#step:15:15768 with the error RuntimeError: Worker failed with error 'The has_multicast_supportAPI is deprecated for SymmetricMemory handles. You can check ifget_multicast_ptrreturns a non-null pointer, or use thec10d::symmetric_memory::has_multicast_support API instead.', please check the stack trace above for the root cause

That job supposes to use NCCL 2.28.9 https://github.com/pytorch/pytorch/actions/runs/20995910399/job/60431079519#step:7:1134 coming from our pin https://github.com/pytorch/pytorch/blob/main/.ci/docker/ci_commit_pins/nccl.txt

Let me know if this is an infra issue that we need to fix to unblock this change because I only see it failing on DGX B200

cc @zou3519

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Jan 15, 2026

Let me also check why https://github.com/pytorch/pytorch/actions/runs/20995910399/job/60431079759#step:15:15768 is reported as a success while it clearly should fail

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@kwen2501 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jan 15, 2026
This reverts commit ed935ff.

Reverted #172185 on behalf of https://github.com/huydhn due to Sorry for reverting the change but I think it is failing vLLM benchmark job ([comment](#172185 (comment)))
mattteochen pushed a commit to mattteochen/pytorch that referenced this pull request Jan 15, 2026
mattteochen pushed a commit to mattteochen/pytorch that referenced this pull request Jan 15, 2026
mattteochen pushed a commit to mattteochen/pytorch that referenced this pull request Jan 15, 2026
…172185)"

This reverts commit ed935ff.

Reverted pytorch#172185 on behalf of https://github.com/huydhn due to Sorry for reverting the change but I think it is failing vLLM benchmark job ([comment](pytorch#172185 (comment)))
[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jan 16, 2026
@kwen2501
Copy link
Copy Markdown
Collaborator Author

Hi @huydhn sorry, the break is caused by the PR dropping support of some APIs.
I've added the support back.
Trying to re-land now.

@kwen2501
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team Raised by workflow job

@kwen2501
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge -f "Failed Rocm test/inductor/test_cuda_repro.py has been identified as flaky"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

DustyL pushed a commit to DustyL/pytorch that referenced this pull request Jan 17, 2026
Cherry-picked from upstream main:

- [SymmMem] Back symm_mem.empty() with implicit pool (pytorch#172292)
  Automatic memory reuse for symmetric memory allocations

- [SymmMem] Add multimem support for NCCL and NVSHMEM (pytorch#172185)
  Enhanced multi-GPU memory support

- [inductor] Basic Comm Buffer Reuse for Symmetric Memory (pytorch#171909)
  Memory optimization for torch.compile with symmetric buffers

- [BE] Don't print 12 `triton not found` on import (pytorch#172614)
  QoL fix for flop_counter imports

- [inductor] Use custom triton kernel subclass when available (pytorch#167456)
  Enables custom backend heuristics for Triton kernels

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot deleted the gh/kwen2501/306/head branch February 16, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged module: symm_mem Issues and PRs of Symmetric Memory open source release notes: distributed (c10d) release notes category release notes: distributed (symm_mem) release note label for symmetric memory Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants