[C10D] Introduce C++ side Collective Callbacks.#110307
[C10D] Introduce C++ side Collective Callbacks.#110307kumpera wants to merge 11 commits intogh/kumpera/65/basefrom
Conversation
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110307
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 7465d43 with merge base 4d29b40 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. ghstack-source-id: 36454d8 Pull Request resolved: #110307
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. [ghstack-poisoned]
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. [ghstack-poisoned]
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. ghstack-source-id: cf43562 Pull Request resolved: #110307
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. [ghstack-poisoned]
fab3dc8 to
87a08c1
Compare
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. ghstack-source-id: 3a1f366 Pull Request resolved: #110307
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. [ghstack-poisoned]
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. ghstack-source-id: 558887c Pull Request resolved: #110307
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. [ghstack-poisoned]
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. [ghstack-poisoned]
142e02d to
7465d43
Compare
| * infrastrucute. The recomended pattern is for callbacks to enqueue the events | ||
| * on some queue and have a separate thread process those events. If the | ||
| * callback deadlocks, it will hang the whole process and stop PyTorch from | ||
| * detecting failures. Do not call into CUDA. |
There was a problem hiding this comment.
Well, so true what if python thread hangs at GIL...
There was a problem hiding this comment.
this comment refers to the c++ callback. the c++ callback must not hang, or it hangs our watchdog and stops our timeout mechanism. That's why its pretty much only allowed to put events onto a queue.
the thread that reads from the queue (whether thats a c++ thread or a python thread) can hang if it wants to- that wouldn't affect our watchdog, it just affects whatever system is built on top of that callback.
| char m = 'x'; | ||
| write(sync_pipe, &m, 1); | ||
| std::unique_lock<std::mutex> lock(callbacks_lock); | ||
| for (auto& cb : callbacks_list) { |
There was a problem hiding this comment.
nit: can we just use the full name callback? the name cb could mean thousand things.. Not strong opinion but just my thought. But since the program is short, should be fine.
There was a problem hiding this comment.
yep. ill do it in a follow up PR. i have a list of cleanups from the earlier PR too.
fduwjj
left a comment
There was a problem hiding this comment.
LGTM. thanks for keep polishing it.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "this sits on top of another PR #111072 that needs to be reverted due to internal release testing failure / multisect blame" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "this sits on top of another PR #111072 that needs to be reverted due to internal release testing failure / multisect blame" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@kumpera your PR has been successfully reverted. |
This reverts commit 359336e. Reverted #110307 on behalf of https://github.com/wconstab due to this sits on top of another PR #111072 that needs to be reverted due to internal release testing failure / multisect blame ([comment](#110307 (comment)))
Summary: Breaking down pytorch#110307 into smaller pieces to try to land without revert. add most of the code but don't even call the event-creation logic Test Plan: OSS CI and internal tests Differential Revision: D50502961
Summary: Breaking down pytorch#110307 into smaller pieces to try to land without revert. This PR adds some hook functions but does not call them. Test Plan: OSS CI and internal tests pyper test: https://www.internalfb.com/mlhub/pipelines/runs/fblearner/493040128 Differential Revision: D50460640
Stack from ghstack (oldest at bottom):
C++ side callbacks allow for advance users to get
access to the collective firehose.
It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.