Move cuda driver exit handling from helpers to threads by wconstab · Pull Request #111061 · pytorch/pytorch

wconstab · 2023-10-11T18:23:43Z

Stack from ghstack (oldest at bottom):

The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited. If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.

In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message. Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)

The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) [ghstack-poisoned]

pytorch-bot · 2023-10-11T18:23:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111061

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 94ba860 with merge base 4d29b40 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) ghstack-source-id: 07a9598 Pull Request resolved: #111061

The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) [ghstack-poisoned]

malfet · 2023-10-11T21:11:20Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+      watchDogException_ = std::make_exception_ptr(std::runtime_error(exitMsg));
+      std::rethrow_exception(watchDogException_);


I thought that std::make_exception_ptr is only needed/applicable if one wants to avoid downcasting type errors. I.e. what's wrong with good old throw?
Also, are there any docs on the lifetime of exception_ptr object? (I.e. I would not expect it to be a valid pointer outside of current stack unroll chain)

Suggested change

watchDogException_ = std::make_exception_ptr(std::runtime_error(exitMsg));

std::rethrow_exception(watchDogException_);

throw watchDogException_ = std::runtime_error(exitMsg);

tagging @rohan-varma @H-Huang @kwen2501 for any context about this. (we can probably address it in a later PR but it would be good to follow up)

I don't understand that re-throw logic as well... Since this is a refactor, I think it's ok that we do it for now. We can probably add a TODO and I can follow up with the team later to see what is the best way to throw exceptions.

fduwjj

Agree with @malfet that we need to revisit the exception throwing logic and accept to unblock now.

The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) [ghstack-poisoned]

wconstab · 2023-10-12T13:43:54Z

@pytorchbot merge

pytorchmergebot · 2023-10-12T13:46:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: #111072 Approved by: https://github.com/malfet ghstack dependencies: #111061

C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. Pull Request resolved: #110307 Approved by: https://github.com/fduwjj ghstack dependencies: #111061, #111072

wconstab requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners October 11, 2023 18:23

pytorch-bot bot added release notes: distributed (c10d) release notes category labels Oct 11, 2023

This was referenced Oct 11, 2023

Reland #2 "[C10] PG observability hooks. (#108815)" #111069

Closed

Reland #2 "[C10] PG observability hooks. (#108815, #110907)" #111072

Closed

wconstab added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 11, 2023

malfet reviewed Oct 11, 2023

View reviewed changes

malfet approved these changes Oct 11, 2023

View reviewed changes

fduwjj approved these changes Oct 11, 2023

View reviewed changes

This was referenced Oct 12, 2023

[C10D] Introduce C++ side Collective Callbacks. #110307

Closed

[C10D] Report detected failures when emitting collective end events. #109739

Closed

pytorchmergebot added the merging label Oct 12, 2023

pytorchmergebot added Merged and removed merging labels Oct 12, 2023

pytorchmergebot closed this in aec0f98 Oct 12, 2023

This was referenced Oct 13, 2023

[C10D] callback cleanups and comments #111239

Closed

[C10D] address PR feedback for test_hooks.py #111246

Closed

facebook-github-bot deleted the gh/wconstab/203/head branch October 15, 2023 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move cuda driver exit handling from helpers to threads#111061

Move cuda driver exit handling from helpers to threads#111061
wconstab wants to merge 3 commits intogh/wconstab/203/basefrom
gh/wconstab/203/head

wconstab commented Oct 11, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 11, 2023 •

edited

Loading

Uh oh!

malfet Oct 11, 2023

Uh oh!

wconstab Oct 11, 2023

Uh oh!

fduwjj Oct 11, 2023

Uh oh!

fduwjj left a comment

Uh oh!

wconstab commented Oct 12, 2023

Uh oh!

pytorchmergebot commented Oct 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		watchDogException_ = std::make_exception_ptr(std::runtime_error(exitMsg));
		std::rethrow_exception(watchDogException_);

	watchDogException_ = std::make_exception_ptr(std::runtime_error(exitMsg));
	std::rethrow_exception(watchDogException_);
	throw watchDogException_ = std::runtime_error(exitMsg);

Conversation

wconstab commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111061

✅ No Failures

Uh oh!

malfet Oct 11, 2023

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 11, 2023

Choose a reason for hiding this comment

Uh oh!

fduwjj Oct 11, 2023

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Oct 12, 2023

Uh oh!

pytorchmergebot commented Oct 12, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Oct 11, 2023 •

edited

Loading

pytorch-bot bot commented Oct 11, 2023 •

edited

Loading