Always build USE_DISTRIBUTED.#160449
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160449
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 92a138e with merge base 32911ff ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
albanD
left a comment
There was a problem hiding this comment.
What is this doing exactly?
Is this about windows only?
|
@albanD the proximal reason I ended up here is that I want c10d ops to always be defined even if they don't have implementations, but these ops depend on ProcessGroup and other C structs that are torchbind'ed so I can't actually stub these, and then I was like, why don't we just always build and link this in. |
|
It is quite annoying, this makes PyTorch not compilable today with gcc14 (which is the default on Fedora used at Meta). Removing this means that the distributed team needs to sign up to have ubn on newer version and dependency upgrades to not prevent PyTorch from being buildable. |
Help me understand in more detail what's going on here. Is the problem c10d, or is the problem the NCCL backend for c10d. Because I still allow you to disable NCCL build (and if you like, I can change the meaning of USE_DISTRIBUTED=0 to mean "compile c10d but don't compile any of the distributed backends).
Uhhh... yes?? We should??? Like in what universe is distributed not being buildable not a release UBN. If we think NCCL is old and rickety, I hear we have something coming down the pipe... https://docs.google.com/document/d/1OwrW8QOLw1HJDzuBwTgvKCzPAQsDywcCzwCnwBWs-1g/edit?tab=t.0#heading=h.4zcurf5wxe5n |
There were two separate incidents here in the last couple months: one is about tensorpipe and one about nccl. The problem is that tensorpipe has been archived for a while, so Nikita had to unarchive it to fix it. On the NCCL side, any upgrade is a big risky change that takes a lot of time to get in (after it is fixed upstream).
The problem here is that it's not a PyTorch release but the work Fedora machine I use which uses a recent gcc. So until we upgrade our CD machines to this, then it's not release blocking. But it's also a self-fullfilling prophecy because we won't upgrade the CD machines if it breaks everything. So it will never be a release UBN because we won't force that to happen. If your goal is to have consistent binding, would an option to move the binding outside of the USE_DISTRIBUTED flag. And make everything single one of them error out if called when the flag is not set? |
This sounds like problems with distributed backends specifically. So would you be happy if there is just a way to disable all the backends? This way we consistently have bindings for everything, but they are all stubbed out (as the backends are not defined). |
|
Oops I stamped before reading all the comments. Well, I still think we should do this but I didn't mean to steamroll albans concerns. Anyway the plan is not to build nccl/gloo/tensorpipe by default so things should be ok right? (Are there any tool chain support issues in c10d proper?) |
|
So, I think concretely what I will do is make USE_DISTRIBUTED=0 disable all the backends instead of what it used to do. |
| # enable debug asserts in serialization | ||
| export TORCH_SERIALIZATION_DEBUG=1 | ||
|
|
||
| python -mpip install --no-input -r requirements.txt |
There was a problem hiding this comment.
There was a problem hiding this comment.
OK, sounds fine, I think either yours or mine will suffice for this PR
|
@ezyang your PR has been successfully reverted. |
|
Starting merge as part of PR stack under #159889 |
) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: #159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449
…ibuted modules importable even when backend not built (pytorch#159889) Summary: Original: D81957844 and D81957923 Also, pytorch#162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Differential Revision: D82113620
…modules importable even when backend not built (#159889) (#162594) Summary: Original: D81957844 and D81957923 Also, #162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: #162594 Approved by: https://github.com/H-Huang, https://github.com/dcci
…ributed modules importable even when backend not built (#159889) (#162594)" This reverts commit 6e8f17c. Reverted #162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](#162594 (comment)))
…modules importable even when backend not built (#159889) (#162594) Summary: Original: D81957844 and D81957923 Also, #162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: #162594 Approved by: https://github.com/H-Huang, https://github.com/dcci
Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
…rch#159889) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#159889 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#160449
Stack from ghstack (oldest at bottom):
Signed-off-by: Edward Yang ezyang@meta.com
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim