Skip to content

[nccl symm mem] don't use arg for mempool, correctly use symmetric registration in hooks#161238

Closed
ngimel wants to merge 4 commits intomainfrom
ngimel/fix_nccl_segment_seg
Closed

[nccl symm mem] don't use arg for mempool, correctly use symmetric registration in hooks#161238
ngimel wants to merge 4 commits intomainfrom
ngimel/fix_nccl_segment_seg

Conversation

@ngimel
Copy link
Collaborator

@ngimel ngimel commented Aug 22, 2025

@ngimel ngimel requested review from eqy and syed-ahmed as code owners August 22, 2025 03:08
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161238

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4e06b19 with merge base 1de4540 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 22, 2025
@ngimel ngimel requested a review from kwen2501 August 22, 2025 03:10
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ngimel ngimel added suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) ciflow/h100-distributed labels Aug 22, 2025
memPool_ = std::make_unique<c10::cuda::MemPool>(allocator);
// Register so that we call ncclCommRegister on all new allocations
registerMemPool(memPool_.get());
registerMemPool(memPool_.get(), false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falling into positional boolean trap. Would an enum make more sense?

@ngimel
Copy link
Collaborator Author

ngimel commented Aug 22, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased ngimel/fix_nccl_segment_seg onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ngimel/fix_nccl_segment_seg && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the ngimel/fix_nccl_segment_seg branch from 0ccc6b9 to 951d38c Compare August 22, 2025 23:08
@ngimel
Copy link
Collaborator Author

ngimel commented Aug 23, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 23, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@ngimel
Copy link
Collaborator Author

ngimel commented Aug 23, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #161238, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@ngimel
Copy link
Collaborator Author

ngimel commented Aug 24, 2025

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased ngimel/fix_nccl_segment_seg onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout ngimel/fix_nccl_segment_seg && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the ngimel/fix_nccl_segment_seg branch from 951d38c to 4e06b19 Compare August 24, 2025 05:22
@ngimel
Copy link
Collaborator Author

ngimel commented Aug 25, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-distributed ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants