Move cuda driver exit handling from helpers to threads#111061
Move cuda driver exit handling from helpers to threads#111061wconstab wants to merge 3 commits intogh/wconstab/203/basefrom
Conversation
The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111061
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 94ba860 with merge base 4d29b40 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) ghstack-source-id: 07a9598 Pull Request resolved: #111061
The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) [ghstack-poisoned]
| watchDogException_ = std::make_exception_ptr(std::runtime_error(exitMsg)); | ||
| std::rethrow_exception(watchDogException_); |
There was a problem hiding this comment.
I thought that std::make_exception_ptr is only needed/applicable if one wants to avoid downcasting type errors. I.e. what's wrong with good old throw?
Also, are there any docs on the lifetime of exception_ptr object? (I.e. I would not expect it to be a valid pointer outside of current stack unroll chain)
| watchDogException_ = std::make_exception_ptr(std::runtime_error(exitMsg)); | |
| std::rethrow_exception(watchDogException_); | |
| throw watchDogException_ = std::runtime_error(exitMsg); |
There was a problem hiding this comment.
tagging @rohan-varma @H-Huang @kwen2501 for any context about this. (we can probably address it in a later PR but it would be good to follow up)
There was a problem hiding this comment.
I don't understand that re-throw logic as well... Since this is a refactor, I think it's ok that we do it for now. We can probably add a TODO and I can follow up with the team later to see what is the best way to throw exceptions.
The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: #111072 Approved by: https://github.com/malfet ghstack dependencies: #111061
C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. Pull Request resolved: #110307 Approved by: https://github.com/fduwjj ghstack dependencies: #111061, #111072
Stack from ghstack (oldest at bottom):
The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited. If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.
In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message. Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)