Skip to content

Make distributed modules importable even when backend not built#159889

Closed
ezyang wants to merge 33 commits intogh/ezyang/3122/basefrom
gh/ezyang/3122/head
Closed

Make distributed modules importable even when backend not built#159889
ezyang wants to merge 33 commits intogh/ezyang/3122/basefrom
gh/ezyang/3122/head

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Aug 5, 2025

Stack from ghstack (oldest at bottom):

This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang ezyang@meta.com

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159889

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0227778 with merge base 32911ff (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 5, 2025
ezyang added a commit that referenced this pull request Aug 5, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: 15a7a5a
Pull-Request: #159889
@ezyang
Copy link
Contributor Author

ezyang commented Aug 6, 2025

People were generally positive in chat so I'm going to push on this

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 7, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: 9d066a4
Pull-Request: #159889
@ezyang ezyang changed the title [POC] Make distributed modules importable without distributed build Make distributed modules importable without distributed build Aug 7, 2025
@ezyang
Copy link
Contributor Author

ezyang commented Aug 7, 2025

The patch is now dramatically improved and you can review it IGNORING whitespace. However, to do this I had to remove indentation which will also make it merge conflictly.

The operator registrations are kind of sus, that's the last thing for me to review.

@ezyang ezyang requested review from XilunWu, wconstab and zpcore August 7, 2025 13:16
[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 7, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: ea2f6df
Pull-Request: #159889
[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 7, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: 5648df9
Pull-Request: #159889
…ild"


I did this PR with claude code. No big deal if we decide to reject it. The goal is to get ar.py (checked in, needs to be removed) to run on a CPU build of PyTorch. The general strategy is to generate stubs for all of the things from the c10d C extension when it's not available. There's a new module torch/distributed/_distributed_c10d.py that mediates all of the C extension imports so that we generate all the stubs in Python in this case.

If we agree that we are willing to do this, I will do a careful code review and clean up problems before subsequent review. The PR here is just to give you the flavor of the change.

Signed-off-by: Edward Yang <ezyangmeta.com>

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 7, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: 2628bc2
Pull-Request: #159889
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds pretty good!

@@ -0,0 +1,41 @@
# Copyright (c) Meta Platforms, Inc. and affiliates
# Owner(s): ["oncall: distributed"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be first line for it to work properly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it needs to be, y'all should fix the linter lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang-mac:pytorch-tmp3 ezyang$ git grep -n Owner\(s  | grep 2: | wc -l
     114


if not HAS_DISTRIBUTED:
# Define missing c10d operators
with torch.library._scoped_library("c10d", "DEF") as lib_def:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a scoped lib but a regular library if we want to def to remain right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very confusing, because I see uses of scoped library that seemingly seem to imply that it is not gc'ed at the end. And empirically these library defs were enough to stop errors

with torch.library._scoped_library("aten", "IMPL") as aten:
    aten.impl("_nested_get_jagged_dummy", _nested_get_jagged_dummy, "CPU")
    aten.impl("_nested_get_jagged_dummy", _nested_get_jagged_dummy, "CUDA")
    aten.impl("_nested_get_jagged_dummy", _nested_get_jagged_dummy, "Meta")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lib is definitely cleaned up... @zou3519 maybe remembers from the top of your head?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, in the new version of this PR, I don't need any of this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoped library does gc at the end. Any of the uses that don't are either bugs or people doing copy-pasta or wanting to be extra safe.


else:
from torch._C._distributed_c10d import Backend as C10dBackend
if True: # just to temporarily avoid reindentation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok when looking without whitespaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's just to avoid conflicting


if HAS_DISTRIBUTED:
# Distributed components are available - import from C extension
from torch._C._distributed_c10d import ( # Basic components; Additional distributed_c10d components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to refactor this to

  1. put all the stubs into torch.distributed._C_stubs.py
  2. conditionalize this line like
    src_module = "torch._C._distributed_c10d" if HAS_DISTRIBUTED else "torch.distributed._C_stubs"
  3. from src_module import (
    ...
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we can't literally do this. Hmm, what is idiomatic for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose we switch this into a wildcard import

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From _C import *? Hmm not sure if there is really a problem with that other than discoverability of where things come from. Would be a lot cleaner!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wildcard the C module as it doesn't set __all__ correctly

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 10, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: dc77fbd
Pull-Request: #159889
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Sep 4, 2025
…lt (#159889)"

This reverts commit 4ae57d4.

Reverted #159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](#159889 (comment)))
@pytorchmergebot
Copy link
Collaborator

@ezyang your PR has been successfully reverted.

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Sep 4, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: 65ed8ca
Pull-Request: #159889
@ezyang
Copy link
Contributor Author

ezyang commented Sep 4, 2025

@pytorchbot merge

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Sep 4, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: 8cdf5e2
Pull-Request: #159889
@ezyang
Copy link
Contributor Author

ezyang commented Sep 4, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

[ghstack-poisoned]
@ezyang
Copy link
Contributor Author

ezyang commented Sep 4, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@ezyang your PR has been reverted as part of the stack under #160449.

@ezyang
Copy link
Contributor Author

ezyang commented Sep 5, 2025

@pytorchbot merge -f "diff train forgot to get the companion diff"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@ezyang your PR has been reverted as part of the stack under #160449.

[ghstack-poisoned]
@ezyang
Copy link
Contributor Author

ezyang commented Sep 8, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ezyang
Copy link
Contributor Author

ezyang commented Sep 10, 2025

@pytorchbot revert -c "parent commit has a bad bug"

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 10, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'parent commit has a bad bug' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@ezyang
Copy link
Contributor Author

ezyang commented Sep 10, 2025

@pytorchbot revert -c weird -m "parent commit has a bad fbcode only change, need to revert this too to avoid problems"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 159889 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9 returned non-zero exit code 1

Auto-merging .ci/pytorch/macos-test.sh
CONFLICT (modify/delete): torch/distributed/_distributed_c10d.py deleted in parent of a0d026688cd (Make distributed modules importable even when backend not built (#159889)) and modified in HEAD.  Version HEAD of torch/distributed/_distributed_c10d.py left in tree.
Auto-merging torch/distributed/_functional_collectives.py
Auto-merging torch/distributed/device_mesh.py
error: could not revert a0d026688cd... Make distributed modules importable even when backend not built (#159889)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/h100-symm-mem ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants