Make distributed modules importable even when backend not built#159889
Make distributed modules importable even when backend not built#159889ezyang wants to merge 33 commits intogh/ezyang/3122/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159889
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 0227778 with merge base 32911ff ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
People were generally positive in chat so I'm going to push on this |
|
The patch is now dramatically improved and you can review it IGNORING whitespace. However, to do this I had to remove indentation which will also make it merge conflictly. The operator registrations are kind of sus, that's the last thing for me to review. |
…ild" I did this PR with claude code. No big deal if we decide to reject it. The goal is to get ar.py (checked in, needs to be removed) to run on a CPU build of PyTorch. The general strategy is to generate stubs for all of the things from the c10d C extension when it's not available. There's a new module torch/distributed/_distributed_c10d.py that mediates all of the C extension imports so that we generate all the stubs in Python in this case. If we agree that we are willing to do this, I will do a careful code review and clean up problems before subsequent review. The PR here is just to give you the flavor of the change. Signed-off-by: Edward Yang <ezyangmeta.com> cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta [ghstack-poisoned]
| @@ -0,0 +1,41 @@ | |||
| # Copyright (c) Meta Platforms, Inc. and affiliates | |||
| # Owner(s): ["oncall: distributed"] | |||
There was a problem hiding this comment.
Does this need to be first line for it to work properly?
There was a problem hiding this comment.
Well, if it needs to be, y'all should fix the linter lol
There was a problem hiding this comment.
ezyang-mac:pytorch-tmp3 ezyang$ git grep -n Owner\(s | grep 2: | wc -l
114
torch/distributed/_op_stubs.py
Outdated
|
|
||
| if not HAS_DISTRIBUTED: | ||
| # Define missing c10d operators | ||
| with torch.library._scoped_library("c10d", "DEF") as lib_def: |
There was a problem hiding this comment.
This shouldn't be a scoped lib but a regular library if we want to def to remain right?
There was a problem hiding this comment.
I find this very confusing, because I see uses of scoped library that seemingly seem to imply that it is not gc'ed at the end. And empirically these library defs were enough to stop errors
with torch.library._scoped_library("aten", "IMPL") as aten:
aten.impl("_nested_get_jagged_dummy", _nested_get_jagged_dummy, "CPU")
aten.impl("_nested_get_jagged_dummy", _nested_get_jagged_dummy, "CUDA")
aten.impl("_nested_get_jagged_dummy", _nested_get_jagged_dummy, "Meta")
There was a problem hiding this comment.
the lib is definitely cleaned up... @zou3519 maybe remembers from the top of your head?
There was a problem hiding this comment.
Fortunately, in the new version of this PR, I don't need any of this!
There was a problem hiding this comment.
scoped library does gc at the end. Any of the uses that don't are either bugs or people doing copy-pasta or wanting to be extra safe.
|
|
||
| else: | ||
| from torch._C._distributed_c10d import Backend as C10dBackend | ||
| if True: # just to temporarily avoid reindentation |
There was a problem hiding this comment.
Ok when looking without whitespaces
There was a problem hiding this comment.
Yeah it's just to avoid conflicting
|
|
||
| if HAS_DISTRIBUTED: | ||
| # Distributed components are available - import from C extension | ||
| from torch._C._distributed_c10d import ( # Basic components; Additional distributed_c10d components |
There was a problem hiding this comment.
would it be possible to refactor this to
- put all the stubs into torch.distributed._C_stubs.py
- conditionalize this line like
src_module = "torch._C._distributed_c10d" if HAS_DISTRIBUTED else "torch.distributed._C_stubs" - from src_module import (
...
)
There was a problem hiding this comment.
Well we can't literally do this. Hmm, what is idiomatic for this
There was a problem hiding this comment.
I propose we switch this into a wildcard import
There was a problem hiding this comment.
From _C import *? Hmm not sure if there is really a problem with that other than discoverability of where things come from. Would be a lot cleaner!
There was a problem hiding this comment.
Can't wildcard the C module as it doesn't set __all__ correctly
|
@pytorchbot successfully started a revert job. Check the current status here. |
…lt (#159889)" This reverts commit 4ae57d4. Reverted #159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](#159889 (comment)))
|
@ezyang your PR has been successfully reverted. |
|
@pytorchbot merge |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge |
Merge failedReason: New commits were pushed while merging. Please rerun the merge command. Details for Dev Infra teamRaised by workflow job |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge -f "diff train forgot to get the companion diff" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -c "parent commit has a bad bug" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -c weird -m "parent commit has a bad fbcode only change, need to revert this too to avoid problems" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 159889 failedReason: Command Details for Dev Infra teamRaised by workflow job |
Stack from ghstack (oldest at bottom):
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.
Signed-off-by: Edward Yang ezyang@meta.com
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim