Skip to content

Enables NCCL symmetric memory kernels through mempool registration#155134

Closed
syed-ahmed wants to merge 4 commits intogh/syed-ahmed/2/basefrom
gh/syed-ahmed/2/head
Closed

Enables NCCL symmetric memory kernels through mempool registration#155134
syed-ahmed wants to merge 4 commits intogh/syed-ahmed/2/basefrom
gh/syed-ahmed/2/head

Conversation

@syed-ahmed
Copy link
Collaborator

@syed-ahmed syed-ahmed commented Jun 4, 2025

[ghstack-poisoned]
@syed-ahmed syed-ahmed requested a review from eqy as a code owner June 4, 2025 17:16
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155134

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit acd9b65 with merge base 1036f6d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 4, 2025
syed-ahmed added a commit that referenced this pull request Jun 4, 2025
@syed-ahmed syed-ahmed marked this pull request as draft June 4, 2025 17:17
@syed-ahmed
Copy link
Collaborator Author

Needs NCCL 2.27.

@syed-ahmed syed-ahmed requested a review from kwen2501 June 4, 2025 18:15
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM in general. Left some comments inline.

~MemPool();

MempoolId_t id();
bool symm_mem();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is_symmetric()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

// register future segments allocated in this pool (this call is idempotent).
attachAllocatorHooks();
auto snapshot = c10::cuda::CUDACachingAllocator::snapshot(pool->id());
auto snapshot = c10::cuda::CUDACachingAllocator::snapshot();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is pool->id() removed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syed-ahmed thought?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this a typo when I had to rebase. We should have the pool id.

Comment on lines +1172 to +1173
// TODO:
// if(pool->symm_mem()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you feel there is some granularity mismatch here?

  • pool->symm_mem() is a pool-level attribute;
  • In the registerSegment function above, it seems possible to create symmetric window for some segments while not for others (the previous ncclCommRegister call)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah because symm_mem is a pool level attribute, we should make sure all memory in it is symmetric. I don't have a good solution right now other than having these two all gathers check and raise an error if the pool becomes not symmetric. Left this as a todo because not sure how it would impact performance or if we should only have these checks as a debugging feature.

@kwen2501
Copy link
Collaborator

kwen2501 commented Jun 6, 2025

2.27 upgrade here: #155233

@syed-ahmed syed-ahmed marked this pull request as ready for review June 17, 2025 17:37
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock.
NCCL 2.27 has landed -- do you mind rebase this PR to make sure all tests pass?
Then hopefully we can land this feature in PyTorch 2.8.
Thanks!

bool is_user_created,
bool use_on_oom) {
bool use_on_oom,
bool symm_mem) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

// register future segments allocated in this pool (this call is idempotent).
attachAllocatorHooks();
auto snapshot = c10::cuda::CUDACachingAllocator::snapshot(pool->id());
auto snapshot = c10::cuda::CUDACachingAllocator::snapshot();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syed-ahmed thought?

@kwen2501
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jun 21, 2025
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/syed-ahmed/2/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/155134)

[ghstack-poisoned]
@kwen2501
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]
kwen2501 pushed a commit that referenced this pull request Jun 21, 2025
@kwen2501
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jun 23, 2025
`all_to_all_vdev` are not binding of NVSHMEM APIs. Removing the `nvshmem_` prefix.

Pull Request resolved: #156582
Approved by: https://github.com/fduwjj
ghstack dependencies: #155134

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwen2501 looks like argument name change symm_mem -> symmetric didn't apply to the test case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching! Added the fix: #157293

pytorchmergebot pushed a commit that referenced this pull request Jul 9, 2025
@stas00
Copy link
Contributor

stas00 commented Jul 23, 2025

Thank you for adding this support, @syed-ahmed

Why didn't the PR include user-facing documentation? What is the point of having a great feature if nobody knows about it?

Thank you.

ngimel added a commit that referenced this pull request Aug 22, 2025
pytorchmergebot pushed a commit that referenced this pull request Aug 22, 2025
pytorchmergebot pushed a commit that referenced this pull request Aug 24, 2025
@github-actions github-actions bot deleted the gh/syed-ahmed/2/head branch August 31, 2025 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants