Skip to content

[c10d] Error out the case when registering symmetric memory without eager init#160145

Closed
fduwjj wants to merge 5 commits intogh/fduwjj/179/basefrom
gh/fduwjj/179/head
Closed

[c10d] Error out the case when registering symmetric memory without eager init#160145
fduwjj wants to merge 5 commits intogh/fduwjj/179/basefrom
gh/fduwjj/179/head

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Aug 7, 2025

Stack from ghstack (oldest at bottom):

Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160145

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e464581 with merge base f33ce40 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj added a commit that referenced this pull request Aug 7, 2025
…ager init

ghstack-source-id: ea4e1f4
Pull Request resolved: #160145
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 7, 2025
…y without eager init"


Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 8, 2025
…ager init

ghstack-source-id: 26f2135
Pull Request resolved: #160145
@fduwjj fduwjj requested a review from syed-ahmed August 8, 2025 04:21
…y without eager init"


Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 8, 2025
…ager init

ghstack-source-id: 23cefcf
Pull Request resolved: #160145
@fduwjj fduwjj requested a review from Skylion007 August 8, 2025 21:14
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
cc @syed-ahmed in case there are other thoughts. [Edit: just saw that @syed-ahmed has reviewed this PR too]

…y without eager init"


Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 12, 2025
…ager init

ghstack-source-id: b03d79c
Pull Request resolved: #160145
@fduwjj fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025
@fduwjj
Copy link
Contributor Author

fduwjj commented Aug 12, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/fduwjj/179/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/160145)

pytorchmergebot pushed a commit that referenced this pull request Aug 12, 2025
…ager init

ghstack-source-id: 08bf205
Pull Request resolved: #160145
@fduwjj
Copy link
Contributor Author

fduwjj commented Aug 12, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ngimel
Copy link
Collaborator

ngimel commented Aug 12, 2025

I'm slightly surprised you didn't need to change test_nccl_user_buffer_registration

c10d.init_process_group(
backend="nccl", rank=self.rank, world_size=self.world_size, store=store
)
device = torch.device(f"cuda:{self.rank}")
torch.cuda.set_device(self.rank)
pg = c10d.distributed_c10d._get_default_group()
backend = pg._get_backend(torch.device(device))
# Use NCCL memory allocator
pool = torch.cuda.MemPool(backend.mem_allocator)
# allocate memory with ncclMemAlloc
with torch.cuda.use_mem_pool(pool):
tensor = torch.arange(1024 * 1024 * 2, device=device)
# register buffers to NCCL
backend.register_mem_pool(pool)
, is it because we are using nccl comms between tests and by the time we run it communicator is already created? Would this test fail if run standalone?

@fduwjj
Copy link
Contributor Author

fduwjj commented Aug 13, 2025

@ngimel Hmmm good point, is it because test_nccl_user_buffer_registration only belongs periodic? since it is using 4 GPUs?

@fduwjj
Copy link
Contributor Author

fduwjj commented Aug 13, 2025

@ngimel fixed in #160497

pytorchmergebot pushed a commit that referenced this pull request Aug 13, 2025
Fixed `test_nccl_user_buffer_registration ` due to #160145, somehow CI didn't capture it.

Pull Request resolved: #160497
Approved by: https://github.com/ngimel
chuanhaozhuge pushed a commit that referenced this pull request Aug 14, 2025
…ager init (#160145)

Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

Pull Request resolved: #160145
Approved by: https://github.com/kwen2501
chuanhaozhuge pushed a commit that referenced this pull request Aug 14, 2025
Fixed `test_nccl_user_buffer_registration ` due to #160145, somehow CI didn't capture it.

Pull Request resolved: #160497
Approved by: https://github.com/ngimel
chuanhaozhuge pushed a commit that referenced this pull request Aug 18, 2025
…ager init (#160145)

Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

Pull Request resolved: #160145
Approved by: https://github.com/kwen2501
chuanhaozhuge pushed a commit that referenced this pull request Aug 18, 2025
Fixed `test_nccl_user_buffer_registration ` due to #160145, somehow CI didn't capture it.

Pull Request resolved: #160497
Approved by: https://github.com/ngimel
can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
…ager init (pytorch#160145)

Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

Pull Request resolved: pytorch#160145
Approved by: https://github.com/kwen2501
can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
Fixed `test_nccl_user_buffer_registration ` due to pytorch#160145, somehow CI didn't capture it.

Pull Request resolved: pytorch#160497
Approved by: https://github.com/ngimel
@github-actions github-actions bot deleted the gh/fduwjj/179/head branch September 13, 2025 02:05
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…ager init (pytorch#160145)

Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

Pull Request resolved: pytorch#160145
Approved by: https://github.com/kwen2501
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Fixed `test_nccl_user_buffer_registration ` due to pytorch#160145, somehow CI didn't capture it.

Pull Request resolved: pytorch#160497
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants