[c10d] Support optional backend if device_id provided by kwen2501 · Pull Request #140963 · pytorch/pytorch

kwen2501 · 2024-11-18T18:06:49Z

Stack from ghstack (oldest at bottom):

-> [c10d] Support optional backend if device_id provided #140963

It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged.

This PR makes the backend specification ("nccl", "gloo") optional when user provides a devce_id to init_process_group (the acceptance of device_id has been previously supported for the purpose of eager init).

New user experience:

device = torch.device(device_type, rank % device_count)
dist.init_process_group(device_id=device)

The line of device = torch.device(...) is anyway needed because user would use it for tensor creation etc.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @zhangxiaoli73 @Chao1Han @jgong5

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-11-18T18:06:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140963

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit af71423 with merge base ffb9790 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: a4a9ea9 Pull Request resolved: #140963

wconstab · 2024-11-18T18:47:23Z

torch/distributed/distributed_c10d.py

+    # >>> init_process_group()
+    # we set it to `undefined` and rely on lazy init.
+    if backend is None:
+        backend = "undefined"


what does backend = "undefined" do? Is it going to throw an error inside Backend below? or it somehow finds a default one later?

It later get translated into "cuda:nccl,cpu:gloo" IIRC.
I guess I can make it go away? This PR just didn't touch that aspect.

jgong5 · 2024-11-19T01:55:32Z

torch/distributed/distributed_c10d.py


    backend_list = [UNDEFINED, GLOO, NCCL, UCC, MPI]

+    # 3rd-party devices can register the default backend support here


Do you expect users to modify this dict directly or we will also add register/unregister API for third-party devices and backends? There are multiple dicts here and I guess register/unregister APIs can make them consistent without users' awareness and also more clear.

Also need to consider to support privateusr1 device too. cc @shink

@jgong5 Thanks for the ping! A register/unregister API will benefit all privateuse1 backends. I support this point.

cc: @fffrog

Yeah, good idea. We can provide registration APIs, which would ideally be called when the a third-party module is imported so that no user involvement is needed. Let me add it in a next PR.

wconstab · 2024-11-19T03:08:25Z

Lgtm

kwen2501 · 2024-11-19T03:10:04Z

@pytorchbot merge

pytorchmergebot · 2024-11-19T03:11:47Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@malfet

Citing @malfet's [comment](pytorch#136343 (review)) in pytorch#136343 > It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged. This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init). New user experience: ``` device = torch.device(device_type, rank % device_count) dist.init_process_group(device_id=device) ``` The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc. Pull Request resolved: pytorch#140963 Approved by: https://github.com/wconstab

[c10d] Support optional backend if device_id provided

af71423

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 18, 2024

kwen2501 added a commit that referenced this pull request Nov 18, 2024

[c10d] Support optional backend if device_id provided

7e27063

ghstack-source-id: a4a9ea9 Pull Request resolved: #140963

kwen2501 requested review from H-Huang, jgong5, shuqiangzhang and wconstab November 18, 2024 18:14

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2024

kwen2501 requested a review from wz337 November 18, 2024 18:42

wconstab reviewed Nov 18, 2024

View reviewed changes

jgong5 reviewed Nov 19, 2024

View reviewed changes

wconstab approved these changes Nov 19, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 19, 2024

pytorchmergebot closed this in 70a0906 Nov 19, 2024

pytorchmergebot added Merged and removed merging labels Nov 19, 2024

kwen2501 mentioned this pull request Nov 19, 2024

API to retrieve default distributed backend from device #140536

Closed

github-actions bot deleted the gh/kwen2501/95/head branch December 20, 2024 02:05

ankurneog mentioned this pull request May 26, 2025

[RFC] : Remove Explicit Backend References from torch.distributed (c10d) #154345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c10d] Support optional backend if device_id provided#140963

[c10d] Support optional backend if device_id provided#140963
kwen2501 wants to merge 1 commit intogh/kwen2501/95/basefrom
gh/kwen2501/95/head

kwen2501 commented Nov 18, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 18, 2024 •

edited

Loading

Uh oh!

wconstab Nov 18, 2024

Uh oh!

kwen2501 Nov 18, 2024 •

edited

Loading

Uh oh!

jgong5 Nov 19, 2024

Uh oh!

jgong5 Nov 19, 2024

Uh oh!

shink Nov 19, 2024

Uh oh!

kwen2501 Nov 19, 2024 •

edited

Loading

Uh oh!

wconstab commented Nov 19, 2024

Uh oh!

kwen2501 commented Nov 19, 2024

Uh oh!

pytorchmergebot commented Nov 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		backend_list = [UNDEFINED, GLOO, NCCL, UCC, MPI]

		# 3rd-party devices can register the default backend support here

Conversation

kwen2501 commented Nov 18, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140963

❗ 1 Active SEVs

✅ No Failures

Uh oh!

wconstab Nov 18, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgong5 Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

shink Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab commented Nov 19, 2024

Uh oh!

kwen2501 commented Nov 19, 2024

Uh oh!

pytorchmergebot commented Nov 19, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Nov 18, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 18, 2024 •

edited

Loading

kwen2501 Nov 18, 2024 •

edited

Loading

kwen2501 Nov 19, 2024 •

edited

Loading