Enables NCCL symmetric memory kernels through mempool registration by syed-ahmed · Pull Request #155134 · pytorch/pytorch

syed-ahmed · 2025-06-04T17:16:29Z

Stack from ghstack (oldest at bottom):

-> Enables NCCL symmetric memory kernels through mempool registration #155134

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-06-04T17:16:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155134

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit acd9b65 with merge base 1036f6d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: b26a144 Pull Request resolved: #155134

syed-ahmed · 2025-06-04T18:13:07Z

Needs NCCL 2.27.

kwen2501

Thanks! LGTM in general. Left some comments inline.

kwen2501 · 2025-06-06T19:42:54Z

c10/cuda/CUDACachingAllocator.h

  ~MemPool();

  MempoolId_t id();
+  bool symm_mem();


nit: is_symmetric()?

Sounds good!

kwen2501 · 2025-06-06T19:53:50Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

  // register future segments allocated in this pool (this call is idempotent).
  attachAllocatorHooks();
-  auto snapshot = c10::cuda::CUDACachingAllocator::snapshot(pool->id());
+  auto snapshot = c10::cuda::CUDACachingAllocator::snapshot();


why is pool->id() removed?

@syed-ahmed thought?

I think this a typo when I had to rebase. We should have the pool id.

kwen2501 · 2025-06-06T19:57:40Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  // TODO:
+  // if(pool->symm_mem()) {


do you feel there is some granularity mismatch here?

pool->symm_mem() is a pool-level attribute;

In the registerSegment function above, it seems possible to create symmetric window for some segments while not for others (the previous ncclCommRegister call)

Yeah because symm_mem is a pool level attribute, we should make sure all memory in it is symmetric. I don't have a good solution right now other than having these two all gathers check and raise an error if the pool becomes not symmetric. Left this as a todo because not sure how it would impact performance or if we should only have these checks as a debugging feature.

kwen2501 · 2025-06-06T21:16:14Z

2.27 upgrade here: #155233

kwen2501

Approving to unblock.
NCCL 2.27 has landed -- do you mind rebase this PR to make sure all tests pass?
Then hopefully we can land this feature in PyTorch 2.8.
Thanks!

kwen2501 · 2025-06-21T00:03:17Z

torch/csrc/cuda/MemPool.cpp

                      bool is_user_created,
-                      bool use_on_oom) {
+                      bool use_on_oom,
+                     bool symm_mem) {


kwen2501 · 2025-06-21T00:06:00Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

  // register future segments allocated in this pool (this call is idempotent).
  attachAllocatorHooks();
-  auto snapshot = c10::cuda::CUDACachingAllocator::snapshot(pool->id());
+  auto snapshot = c10::cuda::CUDACachingAllocator::snapshot();


@syed-ahmed thought?

kwen2501 · 2025-06-21T00:09:33Z

@pytorchbot rebase

pytorchmergebot · 2025-06-21T00:11:01Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

ghstack-source-id: f86d8c2 Pull Request resolved: #155134

pytorchmergebot · 2025-06-21T00:11:15Z

Successfully rebased gh/syed-ahmed/2/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/155134)

[ghstack-poisoned]

kwen2501 · 2025-06-21T07:28:03Z

@pytorchbot merge

pytorchmergebot · 2025-06-21T07:30:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-21T07:56:26Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / lintrunner-noclang / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

ghstack-source-id: 0ad4f8a Pull Request resolved: #155134

kwen2501 · 2025-06-21T20:33:58Z

@pytorchbot merge

pytorchmergebot · 2025-06-21T20:35:42Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

`all_to_all_vdev` are not binding of NVSHMEM APIs. Removing the `nvshmem_` prefix. Pull Request resolved: #156582 Approved by: https://github.com/fduwjj ghstack dependencies: #155134

youngeunkwon0405 · 2025-06-30T17:52:44Z

test/distributed/test_c10d_nccl.py

@kwen2501 looks like argument name change symm_mem -> symmetric didn't apply to the test case.

Thanks for catching! Added the fix: #157293

As mentioned here: #155134 (comment) Pull Request resolved: #157293 Approved by: https://github.com/Skylion007

stas00 · 2025-07-23T20:50:43Z

Thank you for adding this support, @syed-ahmed

Why didn't the PR include user-facing documentation? What is the point of having a great feature if nobody knows about it?

Thank you.

…ation (#155134)" This reverts commit f70c801.

Update

0903469

[ghstack-poisoned]

syed-ahmed requested a review from eqy as a code owner June 4, 2025 17:16

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 4, 2025

syed-ahmed added a commit that referenced this pull request Jun 4, 2025

Enables NCCL symmetric memory kernels through mempool registration

386bc0e

ghstack-source-id: b26a144 Pull Request resolved: #155134

syed-ahmed marked this pull request as draft June 4, 2025 17:17

pytorchbot added the open source label Jun 4, 2025

syed-ahmed requested a review from kwen2501 June 4, 2025 18:15

kwen2501 reviewed Jun 6, 2025

View reviewed changes

syed-ahmed marked this pull request as ready for review June 17, 2025 17:37

kwen2501 approved these changes Jun 21, 2025

View reviewed changes

Update

5d20790

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Jun 21, 2025

Enables NCCL symmetric memory kernels through mempool registration

d6f3338

ghstack-source-id: f86d8c2 Pull Request resolved: #155134

Update

ad37211

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2025

pytorchmergebot added the merging label Jun 21, 2025

pytorchmergebot removed the merging label Jun 21, 2025

Update

acd9b65

[ghstack-poisoned]

kwen2501 pushed a commit that referenced this pull request Jun 21, 2025

Enables NCCL symmetric memory kernels through mempool registration

a291b47

ghstack-source-id: 0ad4f8a Pull Request resolved: #155134

pytorchmergebot added the merging label Jun 21, 2025

pytorchmergebot added the Merged label Jun 21, 2025

pytorchmergebot closed this in f70c801 Jun 21, 2025

pytorchmergebot removed the merging label Jun 21, 2025

youngeunkwon0405 reviewed Jun 30, 2025

View reviewed changes

syed-ahmed mentioned this pull request Jun 30, 2025

Fixes typo in nccl_window_registration test #157293

Closed

pytorchmergebot pushed a commit that referenced this pull request Jul 9, 2025

Fixes typo in nccl_window_registration test (#157293)

f5bbaa2

As mentioned here: #155134 (comment) Pull Request resolved: #157293 Approved by: https://github.com/Skylion007

syed-ahmed mentioned this pull request Jul 31, 2025

Document NCCL Symmetric memory #159536

Open

ngimel added a commit that referenced this pull request Aug 22, 2025

Revert "Enables NCCL symmetric memory kernels through mempool registr…

c95968c

…ation (#155134)" This reverts commit f70c801.

pytorchmergebot pushed a commit that referenced this pull request Aug 22, 2025

Revert "Enables NCCL symmetric memory kernels through mempool registr…

3030915

…ation (#155134)" This reverts commit f70c801.

pytorchmergebot pushed a commit that referenced this pull request Aug 24, 2025

Revert "Enables NCCL symmetric memory kernels through mempool registr…

a488f70

…ation (#155134)" This reverts commit f70c801.

github-actions bot deleted the gh/syed-ahmed/2/head branch August 31, 2025 02:15

Conversation

syed-ahmed commented Jun 4, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155134

⏳ No Failures, 1 Pending

Uh oh!

syed-ahmed commented Jun 4, 2025

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Jun 6, 2025

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Jun 21, 2025

Uh oh!

pytorchmergebot commented Jun 21, 2025

Uh oh!

pytorchmergebot commented Jun 21, 2025

Uh oh!

kwen2501 commented Jun 21, 2025

Uh oh!

pytorchmergebot commented Jun 21, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 21, 2025

Merge failed

Uh oh!

kwen2501 commented Jun 21, 2025

Uh oh!

pytorchmergebot commented Jun 21, 2025

Merge started

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 commented Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

syed-ahmed commented Jun 4, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 4, 2025 •

edited

Loading