Always build USE_DISTRIBUTED. by ezyang · Pull Request #160449 · pytorch/pytorch

ezyang · 2025-08-12T18:56:07Z

Stack from ghstack (oldest at bottom):

Signed-off-by: Edward Yang ezyang@meta.com

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim

[ghstack-poisoned]

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: 3feb172 Pull-Request: #160449

pytorch-bot · 2025-08-12T18:56:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160449

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 92a138e with merge base 32911ff ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (trunk failure)
inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_const

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

What is this doing exactly?
Is this about windows only?

[ghstack-poisoned]

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: e16a910 Pull-Request: #160449

[ghstack-poisoned]

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: cbec225 Pull-Request: #160449

ezyang · 2025-08-12T21:47:42Z

@albanD the proximal reason I ended up here is that I want c10d ops to always be defined even if they don't have implementations, but these ops depend on ProcessGroup and other C structs that are torchbind'ed so I can't actually stub these, and then I was like, why don't we just always build and link this in.

albanD · 2025-08-13T13:47:28Z

It is quite annoying, this makes PyTorch not compilable today with gcc14 (which is the default on Fedora used at Meta).
Now I will have to hunt down the right dependency of distributed that doesn't work and hope there is a working flag to disable it.

Removing this means that the distributed team needs to sign up to have ubn on newer version and dependency upgrades to not prevent PyTorch from being buildable.
Today, these things gets fixed over 6months to a year, depending on when we have bandwitdth and when we can actually do upgrades of dependencies like nccl.

ezyang · 2025-08-14T02:52:16Z

It is quite annoying, this makes PyTorch not compilable today with gcc14 (which is the default on Fedora used at Meta).

Help me understand in more detail what's going on here. Is the problem c10d, or is the problem the NCCL backend for c10d. Because I still allow you to disable NCCL build (and if you like, I can change the meaning of USE_DISTRIBUTED=0 to mean "compile c10d but don't compile any of the distributed backends).

Removing this means that the distributed team needs to sign up to have ubn on newer version and dependency upgrades to not prevent PyTorch from being buildable.

Uhhh... yes?? We should??? Like in what universe is distributed not being buildable not a release UBN. If we think NCCL is old and rickety, I hear we have something coming down the pipe... https://docs.google.com/document/d/1OwrW8QOLw1HJDzuBwTgvKCzPAQsDywcCzwCnwBWs-1g/edit?tab=t.0#heading=h.4zcurf5wxe5n

albanD · 2025-08-14T14:33:26Z

help me understand in more detail what's going on here

There were two separate incidents here in the last couple months: one is about tensorpipe and one about nccl. The problem is that tensorpipe has been archived for a while, so Nikita had to unarchive it to fix it. On the NCCL side, any upgrade is a big risky change that takes a lot of time to get in (after it is fixed upstream).

Uhhh... yes?? We should??? Like in what universe is distributed not being buildable not a release UBN

The problem here is that it's not a PyTorch release but the work Fedora machine I use which uses a recent gcc. So until we upgrade our CD machines to this, then it's not release blocking. But it's also a self-fullfilling prophecy because we won't upgrade the CD machines if it breaks everything. So it will never be a release UBN because we won't force that to happen.
It is still a big challenge for devs doing new CPython enablement, running on non-devserver machines, non-linux machines, etc.
These are not considered UBN and so these issues are not worked on by the distributed team.

If your goal is to have consistent binding, would an option to move the binding outside of the USE_DISTRIBUTED flag. And make everything single one of them error out if called when the flag is not set?

ezyang · 2025-08-15T14:05:35Z

There were two separate incidents here in the last couple months: one is about tensorpipe and one about nccl. The problem is that tensorpipe has been archived for a while, so Nikita had to unarchive it to fix it. On the NCCL side, any upgrade is a big risky change that takes a lot of time to get in (after it is fixed upstream).

This sounds like problems with distributed backends specifically. So would you be happy if there is just a way to disable all the backends? This way we consistently have bindings for everything, but they are all stubbed out (as the backends are not defined).

[ghstack-poisoned]

wconstab

Nice!

wconstab · 2025-08-16T16:00:42Z

Oops I stamped before reading all the comments. Well, I still think we should do this but I didn't mean to steamroll albans concerns. Anyway the plan is not to build nccl/gloo/tensorpipe by default so things should be ok right? (Are there any tool chain support issues in c10d proper?)

ezyang · 2025-08-17T02:03:47Z

So, I think concretely what I will do is make USE_DISTRIBUTED=0 disable all the backends instead of what it used to do.

XuehaiPan · 2025-08-17T03:39:09Z

 # enable debug asserts in serialization
 export TORCH_SERIALIZATION_DEBUG=1

+python -mpip install --no-input -r requirements.txt


FYI about this change:

https://github.com/pytorch/pytorch/pull/158104/files#diff-9f74dbb2a8a76e236e75deff218d19148210d060efa2b0ec9bb97c8e447acf29

OK, sounds fine, I think either yours or mine will suffice for this PR

[ghstack-poisoned]

pytorchmergebot · 2025-09-08T07:04:43Z

@ezyang your PR has been successfully reverted.

[ghstack-poisoned]

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: fcf3d7e Pull-Request: #160449

pytorchmergebot · 2025-09-08T13:39:30Z

Starting merge as part of PR stack under #159889

) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: #159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449

…lt (#159889)" This reverts commit a0d0266. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a. ghstack-source-id: 9057dcf Pull-Request: #162568

…lt (#159889)" (#162568) This reverts commit a0d0266. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a. Pull Request resolved: #162568 Approved by: https://github.com/huydhn

…ibuted modules importable even when backend not built (pytorch#159889) Summary: Original: D81957844 and D81957923 Also, pytorch#162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Differential Revision: D82113620

…modules importable even when backend not built (#159889) (#162594) Summary: Pull Request resolved: #162594 Original: D81957844 and D81957923 Also, #162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: dcci, H-Huang Differential Revision: D82113620

…modules importable even when backend not built (#159889) (#162594) Summary: Original: D81957844 and D81957923 Also, #162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: #162594 Approved by: https://github.com/H-Huang, https://github.com/dcci

…ributed modules importable even when backend not built (#159889) (#162594)" This reverts commit 6e8f17c. Reverted #162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](#162594 (comment)))

…modules importable even when backend not built (#159889) (#162594) Summary: Original: D81957844 and D81957923 Also, #162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: #162594 Approved by: https://github.com/H-Huang, https://github.com/dcci

…lt (#159889)" (#162568) This reverts commit a0d0266. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a. Pull Request resolved: #162568 Approved by: https://github.com/huydhn

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

…rch#159889) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#159889 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#160449

Update

8903d3f

[ghstack-poisoned]

ezyang added a commit that referenced this pull request Aug 12, 2025

Always build USE_DISTRIBUTED.

a59a87f

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: 3feb172 Pull-Request: #160449

pytorch-bot Bot added the topic: not user facing topic category label Aug 12, 2025

github-actions Bot requested review from SherlockNoMad, albanD, antoniojkim, bdhirsh and miladm August 12, 2025 18:56

ezyang requested a review from malfet August 12, 2025 19:08

albanD reviewed Aug 12, 2025

View reviewed changes

Update

25b02c8

[ghstack-poisoned]

ezyang added a commit that referenced this pull request Aug 12, 2025

Always build USE_DISTRIBUTED.

25f91bc

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: e16a910 Pull-Request: #160449

Update

4583673

[ghstack-poisoned]

ezyang requested a review from a team as a code owner August 12, 2025 21:22

ezyang added a commit that referenced this pull request Aug 12, 2025

Always build USE_DISTRIBUTED.

6e9f725

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: cbec225 Pull-Request: #160449

ezyang mentioned this pull request Aug 13, 2025

Make distributed modules importable even when backend not built #159889

Closed

ezyang added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 13, 2025

Update

29d953f

[ghstack-poisoned]

wconstab approved these changes Aug 16, 2025

View reviewed changes

XuehaiPan reviewed Aug 17, 2025

View reviewed changes

Update

22207d5

[ghstack-poisoned]

ezyang requested a review from sraikund16 as a code owner August 18, 2025 17:21

pytorchmergebot reopened this Sep 8, 2025

Update

92a138e

[ghstack-poisoned]

ezyang added a commit that referenced this pull request Sep 8, 2025

Always build USE_DISTRIBUTED.

34d1048

Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: fcf3d7e Pull-Request: #160449

pytorchmergebot closed this in d80297a Sep 8, 2025

ezyang mentioned this pull request Sep 10, 2025

Revert "Make distributed modules importable even when backend not built (#159889)" #162568

Closed

pytorch deleted a comment from pytorchmergebot Sep 10, 2025

Camyll mentioned this pull request Sep 15, 2025

Revert "Make distributed modules importable even when backend not bui… #163024

Merged

pytorchmergebot mentioned this pull request Sep 25, 2025

Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make dist… #163830

Closed

bdhirsh mentioned this pull request Oct 27, 2025

[dynamo][guards] 1/N Guard selectively for DTensor #165824

Closed

Conversation

ezyang commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160449

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Aug 12, 2025

Uh oh!

albanD commented Aug 13, 2025

Uh oh!

ezyang commented Aug 14, 2025

Uh oh!

albanD commented Aug 14, 2025

Uh oh!

ezyang commented Aug 15, 2025

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Aug 16, 2025

Uh oh!

ezyang commented Aug 17, 2025

Uh oh!

XuehaiPan Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Sep 8, 2025

Uh oh!

pytorchmergebot commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ezyang commented Aug 12, 2025 •

edited

Loading

pytorch-bot Bot commented Aug 12, 2025 •

edited

Loading