[c10d] Support optional backend if device_id provided#140963
[c10d] Support optional backend if device_id provided#140963kwen2501 wants to merge 1 commit intogh/kwen2501/95/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140963
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit af71423 with merge base ffb9790 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| # >>> init_process_group() | ||
| # we set it to `undefined` and rely on lazy init. | ||
| if backend is None: | ||
| backend = "undefined" |
There was a problem hiding this comment.
what does backend = "undefined" do? Is it going to throw an error inside Backend below? or it somehow finds a default one later?
There was a problem hiding this comment.
It later get translated into "cuda:nccl,cpu:gloo" IIRC.
I guess I can make it go away? This PR just didn't touch that aspect.
|
|
||
| backend_list = [UNDEFINED, GLOO, NCCL, UCC, MPI] | ||
|
|
||
| # 3rd-party devices can register the default backend support here |
There was a problem hiding this comment.
Do you expect users to modify this dict directly or we will also add register/unregister API for third-party devices and backends? There are multiple dicts here and I guess register/unregister APIs can make them consistent without users' awareness and also more clear.
There was a problem hiding this comment.
Also need to consider to support privateusr1 device too. cc @shink
There was a problem hiding this comment.
Yeah, good idea. We can provide registration APIs, which would ideally be called when the a third-party module is imported so that no user involvement is needed. Let me add it in a next PR.
|
Lgtm |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Citing @malfet's [comment](pytorch#136343 (review)) in pytorch#136343 > It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged. This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init). New user experience: ``` device = torch.device(device_type, rank % device_count) dist.init_process_group(device_id=device) ``` The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc. Pull Request resolved: pytorch#140963 Approved by: https://github.com/wconstab
Stack from ghstack (oldest at bottom):
Citing @malfet's comment in #136343
This PR makes the backend specification ("nccl", "gloo") optional when user provides a
devce_idtoinit_process_group(the acceptance ofdevice_idhas been previously supported for the purpose of eager init).New user experience:
The line of
device = torch.device(...)is anyway needed because user would use it for tensor creation etc.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @zhangxiaoli73 @Chao1Han @jgong5
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o