Skip to content

[C10D] Introduce C++ side Collective Callbacks.#110307

Closed
kumpera wants to merge 11 commits intogh/kumpera/65/basefrom
gh/kumpera/65/head
Closed

[C10D] Introduce C++ side Collective Callbacks.#110307
kumpera wants to merge 11 commits intogh/kumpera/65/basefrom
gh/kumpera/65/head

Conversation

@kumpera
Copy link
Contributor

@kumpera kumpera commented Sep 29, 2023

Stack from ghstack (oldest at bottom):

C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 29, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110307

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7465d43 with merge base 4d29b40 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Sep 29, 2023
kumpera pushed a commit that referenced this pull request Sep 29, 2023
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

ghstack-source-id: 36454d8
Pull Request resolved: #110307
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

[ghstack-poisoned]
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

[ghstack-poisoned]
kumpera pushed a commit that referenced this pull request Oct 2, 2023
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

ghstack-source-id: cf43562
Pull Request resolved: #110307
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

[ghstack-poisoned]
@kumpera kumpera force-pushed the gh/kumpera/65/head branch from fab3dc8 to 87a08c1 Compare October 4, 2023 21:10
kumpera pushed a commit that referenced this pull request Oct 4, 2023
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

ghstack-source-id: 3a1f366
Pull Request resolved: #110307
@kumpera kumpera requested a review from wconstab October 5, 2023 16:02
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

[ghstack-poisoned]
kumpera pushed a commit that referenced this pull request Oct 5, 2023
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

ghstack-source-id: 558887c
Pull Request resolved: #110307
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

[ghstack-poisoned]
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.

[ghstack-poisoned]
Comment on lines +32 to +35
* infrastrucute. The recomended pattern is for callbacks to enqueue the events
* on some queue and have a separate thread process those events. If the
* callback deadlocks, it will hang the whole process and stop PyTorch from
* detecting failures. Do not call into CUDA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, so true what if python thread hangs at GIL...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment refers to the c++ callback. the c++ callback must not hang, or it hangs our watchdog and stops our timeout mechanism. That's why its pretty much only allowed to put events onto a queue.

the thread that reads from the queue (whether thats a c++ thread or a python thread) can hang if it wants to- that wouldn't affect our watchdog, it just affects whatever system is built on top of that callback.

char m = 'x';
write(sync_pipe, &m, 1);
std::unique_lock<std::mutex> lock(callbacks_lock);
for (auto& cb : callbacks_list) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just use the full name callback? the name cb could mean thousand things.. Not strong opinion but just my thought. But since the program is short, should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. ill do it in a follow up PR. i have a list of cleanups from the earlier PR too.

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks for keep polishing it.

@wconstab
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@wconstab
Copy link
Contributor

@pytorchbot revert

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 16, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message, -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@wconstab
Copy link
Contributor

@pytorchbot revert -m "this sits on top of another PR #111072 that needs to be reverted due to internal release testing failure / multisect blame"

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 16, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@wconstab
Copy link
Contributor

@pytorchbot revert -m "this sits on top of another PR #111072 that needs to be reverted due to internal release testing failure / multisect blame" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@kumpera your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Oct 16, 2023
This reverts commit 359336e.

Reverted #110307 on behalf of https://github.com/wconstab due to this sits on top of another PR #111072 that needs to be reverted due to internal release testing failure / multisect blame ([comment](#110307 (comment)))
wconstab added a commit to wconstab/pytorch that referenced this pull request Oct 20, 2023
Summary:
Breaking down pytorch#110307 into smaller
pieces to try to land without revert.

add most of the code but don't even call the event-creation logic

Test Plan: OSS CI and internal tests

Differential Revision: D50502961
wconstab added a commit to wconstab/pytorch that referenced this pull request Oct 20, 2023
Summary:

Breaking down pytorch#110307 into smaller
pieces to try to land without revert.

This PR adds some hook functions but does not call them.

Test Plan:
OSS CI and internal tests

pyper test: https://www.internalfb.com/mlhub/pipelines/runs/fblearner/493040128

Differential Revision: D50460640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants