Skip to content

Always build USE_DISTRIBUTED.#160449

Closed
ezyang wants to merge 14 commits intogh/ezyang/3134/basefrom
gh/ezyang/3134/head
Closed

Always build USE_DISTRIBUTED.#160449
ezyang wants to merge 14 commits intogh/ezyang/3134/basefrom
gh/ezyang/3134/head

Conversation

@ezyang
Copy link
Copy Markdown
Contributor

@ezyang ezyang commented Aug 12, 2025

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 12, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: 3feb172
Pull-Request: #160449
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Aug 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160449

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 92a138e with merge base 32911ff (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Copy Markdown
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing exactly?
Is this about windows only?

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 12, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: e16a910
Pull-Request: #160449
[ghstack-poisoned]
@ezyang ezyang requested a review from a team as a code owner August 12, 2025 21:22
ezyang added a commit that referenced this pull request Aug 12, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: cbec225
Pull-Request: #160449
@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Aug 12, 2025

@albanD the proximal reason I ended up here is that I want c10d ops to always be defined even if they don't have implementations, but these ops depend on ProcessGroup and other C structs that are torchbind'ed so I can't actually stub these, and then I was like, why don't we just always build and link this in.

@ezyang ezyang added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 13, 2025
@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Aug 13, 2025

It is quite annoying, this makes PyTorch not compilable today with gcc14 (which is the default on Fedora used at Meta).
Now I will have to hunt down the right dependency of distributed that doesn't work and hope there is a working flag to disable it.

Removing this means that the distributed team needs to sign up to have ubn on newer version and dependency upgrades to not prevent PyTorch from being buildable.
Today, these things gets fixed over 6months to a year, depending on when we have bandwitdth and when we can actually do upgrades of dependencies like nccl.

@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Aug 14, 2025

It is quite annoying, this makes PyTorch not compilable today with gcc14 (which is the default on Fedora used at Meta).

Help me understand in more detail what's going on here. Is the problem c10d, or is the problem the NCCL backend for c10d. Because I still allow you to disable NCCL build (and if you like, I can change the meaning of USE_DISTRIBUTED=0 to mean "compile c10d but don't compile any of the distributed backends).

Removing this means that the distributed team needs to sign up to have ubn on newer version and dependency upgrades to not prevent PyTorch from being buildable.

Uhhh... yes?? We should??? Like in what universe is distributed not being buildable not a release UBN. If we think NCCL is old and rickety, I hear we have something coming down the pipe... https://docs.google.com/document/d/1OwrW8QOLw1HJDzuBwTgvKCzPAQsDywcCzwCnwBWs-1g/edit?tab=t.0#heading=h.4zcurf5wxe5n

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Aug 14, 2025

help me understand in more detail what's going on here

There were two separate incidents here in the last couple months: one is about tensorpipe and one about nccl. The problem is that tensorpipe has been archived for a while, so Nikita had to unarchive it to fix it. On the NCCL side, any upgrade is a big risky change that takes a lot of time to get in (after it is fixed upstream).

Uhhh... yes?? We should??? Like in what universe is distributed not being buildable not a release UBN

The problem here is that it's not a PyTorch release but the work Fedora machine I use which uses a recent gcc. So until we upgrade our CD machines to this, then it's not release blocking. But it's also a self-fullfilling prophecy because we won't upgrade the CD machines if it breaks everything. So it will never be a release UBN because we won't force that to happen.
It is still a big challenge for devs doing new CPython enablement, running on non-devserver machines, non-linux machines, etc.
These are not considered UBN and so these issues are not worked on by the distributed team.

If your goal is to have consistent binding, would an option to move the binding outside of the USE_DISTRIBUTED flag. And make everything single one of them error out if called when the flag is not set?

@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Aug 15, 2025

There were two separate incidents here in the last couple months: one is about tensorpipe and one about nccl. The problem is that tensorpipe has been archived for a while, so Nikita had to unarchive it to fix it. On the NCCL side, any upgrade is a big risky change that takes a lot of time to get in (after it is fixed upstream).

This sounds like problems with distributed backends specifically. So would you be happy if there is just a way to disable all the backends? This way we consistently have bindings for everything, but they are all stubbed out (as the backends are not defined).

[ghstack-poisoned]
Copy link
Copy Markdown
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@wconstab
Copy link
Copy Markdown
Contributor

Oops I stamped before reading all the comments. Well, I still think we should do this but I didn't mean to steamroll albans concerns. Anyway the plan is not to build nccl/gloo/tensorpipe by default so things should be ok right? (Are there any tool chain support issues in c10d proper?)

@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Aug 17, 2025

So, I think concretely what I will do is make USE_DISTRIBUTED=0 disable all the backends instead of what it used to do.

Comment thread .ci/pytorch/macos-test.sh
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1

python -mpip install --no-input -r requirements.txt
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds fine, I think either yours or mine will suffice for this PR

[ghstack-poisoned]
@ezyang ezyang requested a review from sraikund16 as a code owner August 18, 2025 17:21
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@ezyang your PR has been successfully reverted.

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Sep 8, 2025
Signed-off-by: Edward Yang <ezyang@meta.com>
ghstack-source-id: fcf3d7e
Pull-Request: #160449
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Starting merge as part of PR stack under #159889

pytorchmergebot pushed a commit that referenced this pull request Sep 8, 2025
)

This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: #159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
ezyang added a commit that referenced this pull request Sep 10, 2025
…lt (#159889)"

This reverts commit a0d0266.

Revert "Always build USE_DISTRIBUTED. (#160449)"

This reverts commit d80297a.


ghstack-source-id: 9057dcf
Pull-Request: #162568
@pytorch pytorch deleted a comment from pytorchmergebot Sep 10, 2025
@pytorch pytorch deleted a comment from pytorchmergebot Sep 10, 2025
pytorchmergebot pushed a commit that referenced this pull request Sep 10, 2025
…lt (#159889)" (#162568)

This reverts commit a0d0266.

Revert "Always build USE_DISTRIBUTED. (#160449)"

This reverts commit d80297a.

Pull Request resolved: #162568
Approved by: https://github.com/huydhn
ezyang added a commit to ezyang/pytorch that referenced this pull request Sep 10, 2025
…ibuted modules importable even when backend not built (pytorch#159889)

Summary:
Original: D81957844 and D81957923

Also, pytorch#162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Differential Revision: D82113620
pytorch-bot Bot pushed a commit that referenced this pull request Sep 11, 2025
…modules importable even when backend not built (#159889) (#162594)

Summary:
Pull Request resolved: #162594

Original: D81957844 and D81957923

Also, #162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: dcci, H-Huang

Differential Revision: D82113620
pytorchmergebot pushed a commit that referenced this pull request Sep 12, 2025
…modules importable even when backend not built (#159889) (#162594)

Summary:
Original: D81957844 and D81957923

Also, #162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: #162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
pytorchmergebot added a commit that referenced this pull request Sep 12, 2025
…ributed modules importable even when backend not built (#159889) (#162594)"

This reverts commit 6e8f17c.

Reverted #162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](#162594 (comment)))
pytorchmergebot pushed a commit that referenced this pull request Sep 12, 2025
…modules importable even when backend not built (#159889) (#162594)

Summary:
Original: D81957844 and D81957923

Also, #162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: #162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
Camyll pushed a commit that referenced this pull request Sep 15, 2025
…lt (#159889)" (#162568)

This reverts commit a0d0266.

Revert "Always build USE_DISTRIBUTED. (#160449)"

This reverts commit d80297a.

Pull Request resolved: #162568
Approved by: https://github.com/huydhn
Camyll pushed a commit that referenced this pull request Sep 15, 2025
…lt (#159889)" (#162568)

This reverts commit a0d0266.

Revert "Always build USE_DISTRIBUTED. (#160449)"

This reverts commit d80297a.

Pull Request resolved: #162568
Approved by: https://github.com/huydhn
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…rch#159889)

This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: pytorch#159889
Approved by: https://github.com/wconstab
ghstack dependencies: pytorch#160449
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: inductor (aoti) Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants