Skip to content

Move cuda driver exit handling from helpers to threads#111061

Closed
wconstab wants to merge 3 commits intogh/wconstab/203/basefrom
gh/wconstab/203/head
Closed

Move cuda driver exit handling from helpers to threads#111061
wconstab wants to merge 3 commits intogh/wconstab/203/basefrom
gh/wconstab/203/head

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Oct 11, 2023

Stack from ghstack (oldest at bottom):

The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited. If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.

In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message. Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)

The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited.  If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.

In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message.  Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 11, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111061

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 94ba860 with merge base 4d29b40 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Oct 11, 2023
The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited.  If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.

In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message.  Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)

ghstack-source-id: 07a9598
Pull Request resolved: #111061
@wconstab wconstab added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 11, 2023
The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited.  If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.

In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message.  Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)

[ghstack-poisoned]
Comment on lines +923 to +924
watchDogException_ = std::make_exception_ptr(std::runtime_error(exitMsg));
std::rethrow_exception(watchDogException_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that std::make_exception_ptr is only needed/applicable if one wants to avoid downcasting type errors. I.e. what's wrong with good old throw?
Also, are there any docs on the lifetime of exception_ptr object? (I.e. I would not expect it to be a valid pointer outside of current stack unroll chain)

Suggested change
watchDogException_ = std::make_exception_ptr(std::runtime_error(exitMsg));
std::rethrow_exception(watchDogException_);
throw watchDogException_ = std::runtime_error(exitMsg);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tagging @rohan-varma @H-Huang @kwen2501 for any context about this. (we can probably address it in a later PR but it would be good to follow up)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand that re-throw logic as well... Since this is a refactor, I think it's ok that we do it for now. We can probably add a TODO and I can follow up with the team later to see what is the best way to throw exceptions.

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @malfet that we need to revisit the exception throwing logic and accept to unblock now.

The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited.  If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.

In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message.  Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 12, 2023
This reverts commit 314a502.

Changes since original PR:
Reland 1
 *  rename torch.distributed.hooks to torch.distributed._hooks

Reland 2
 * make _hooks importable even if !distributed.is_available()
 * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack)

(original PR #108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: #111072
Approved by: https://github.com/malfet
ghstack dependencies: #111061
pytorchmergebot pushed a commit that referenced this pull request Oct 13, 2023
C++ side callbacks allow for advance users to get
access to the collective firehose.

It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.
Pull Request resolved: #110307
Approved by: https://github.com/fduwjj
ghstack dependencies: #111061, #111072
@facebook-github-bot facebook-github-bot deleted the gh/wconstab/203/head branch October 15, 2023 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants