[c10d] Error out the case when registering symmetric memory without eager init#160145
[c10d] Error out the case when registering symmetric memory without eager init#160145fduwjj wants to merge 5 commits intogh/fduwjj/179/basefrom
Conversation
…ager init [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160145
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e464581 with merge base f33ce40 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…y without eager init" Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]
…y without eager init" Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]
There was a problem hiding this comment.
LGTM.
cc @syed-ahmed in case there are other thoughts. [Edit: just saw that @syed-ahmed has reviewed this PR too]
…y without eager init" Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
I'm slightly surprised you didn't need to change test_nccl_user_buffer_registration pytorch/test/distributed/test_c10d_nccl.py Lines 3130 to 3146 in e464581 |
|
@ngimel Hmmm good point, is it because |
Fixed `test_nccl_user_buffer_registration ` due to #160145, somehow CI didn't capture it. Pull Request resolved: #160497 Approved by: https://github.com/ngimel
…ager init (#160145) Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. Pull Request resolved: #160145 Approved by: https://github.com/kwen2501
Fixed `test_nccl_user_buffer_registration ` due to #160145, somehow CI didn't capture it. Pull Request resolved: #160497 Approved by: https://github.com/ngimel
…ager init (#160145) Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. Pull Request resolved: #160145 Approved by: https://github.com/kwen2501
Fixed `test_nccl_user_buffer_registration ` due to #160145, somehow CI didn't capture it. Pull Request resolved: #160497 Approved by: https://github.com/ngimel
…ager init (pytorch#160145) Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. Pull Request resolved: pytorch#160145 Approved by: https://github.com/kwen2501
Fixed `test_nccl_user_buffer_registration ` due to pytorch#160145, somehow CI didn't capture it. Pull Request resolved: pytorch#160497 Approved by: https://github.com/ngimel
…ager init (pytorch#160145) Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. Pull Request resolved: pytorch#160145 Approved by: https://github.com/kwen2501
Fixed `test_nccl_user_buffer_registration ` due to pytorch#160145, somehow CI didn't capture it. Pull Request resolved: pytorch#160497 Approved by: https://github.com/ngimel
Stack from ghstack (oldest at bottom):
Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.
cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta