Log error in setErrorIfNeeded by pritamdamania87 · Pull Request #36019 · pytorch/pytorch

pritamdamania87 · 2020-04-04T05:55:47Z

Stack from ghstack:

Log error in setErrorIfNeeded #36019 Appropriately handle exceptions in autograd engine.

Once the autograd engine is finished with a GraphTask it would call
markCompleted on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this in setErrorIfNeeded, we log the error if we are ignoring it.

Differential Revision: D20854806

Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/) [ghstack-poisoned]

dr-ci · 2020-04-04T05:56:40Z

💊 CircleCI build failures summary and remediations

As of commit 04aff0b (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

pytorch_python_doc_push (1/2)

Step: "Doc Build and Push" (full log | pattern match details)

Apr 06 19:55:19 Could not import extension javasphinx (exception: cannot import name 'l_')

Apr 06 19:55:18 Saved activation image for GELU() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/GELU.png 
Apr 06 19:55:18 Saved activation image for Sigmoid() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Sigmoid.png 
Apr 06 19:55:18 Saved activation image for Softplus(beta=1, threshold=20) at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Softplus.png 
Apr 06 19:55:18 Saved activation image for Softshrink(0.5) at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Softshrink.png 
Apr 06 19:55:18 Saved activation image for Softsign() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Softsign.png 
Apr 06 19:55:18 Saved activation image for Tanh() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Tanh.png 
Apr 06 19:55:18 Saved activation image for Tanhshrink() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Tanhshrink.png 
Apr 06 19:55:18 Running Sphinx v3.0.0 
Apr 06 19:55:19  
Apr 06 19:55:19 Extension error: 
Apr 06 19:55:19 Could not import extension javasphinx (exception: cannot import name 'l_') 
Apr 06 19:55:19 make: *** [html] Error 2 
Apr 06 19:55:19 Makefile:38: recipe for target 'html' failed

pytorch_cpp_doc_push (2/2)

Step: "Doc Build and Push" (full log | pattern match details)

Apr 06 19:55:54 - [class]: torch::serialize::OutputArchive

Apr 06 19:55:54     - [function]: torch::python::bind_module 
Apr 06 19:55:54     - [function]: torch::python::bind_module 
Apr 06 19:55:54     - [function]: torch::python::init_bindings 
Apr 06 19:55:54     - [namespace]: torch::python::detail 
Apr 06 19:55:54       - [function]: torch::python::detail::bind_cpp_module_wrapper 
Apr 06 19:55:54       - [function]: torch::python::detail::py_object_to_device 
Apr 06 19:55:54       - [function]: torch::python::detail::py_object_to_dtype 
Apr 06 19:55:54       - [typedef]: torch::python::detail::PyModuleClass 
Apr 06 19:55:54   - [namespace]: torch::serialize 
Apr 06 19:55:54     - [class]: torch::serialize::InputArchive 
Apr 06 19:55:54     - [class]: torch::serialize::OutputArchive 
Apr 06 19:55:54   - [typedef]: torch::AutoGradMode 
Apr 06 19:55:54   - [typedef]: torch::Deleter 
Apr 06 19:55:54   - [typedef]: torch::Dtype 
Apr 06 19:55:54   - [typedef]: torch::NoGradGuard 
Apr 06 19:55:54   - [variable]: torch::kArea 
Apr 06 19:55:54   - [variable]: torch::kBatchMean 
Apr 06 19:55:54   - [variable]: torch::kBicubic 
Apr 06 19:55:54   - [variable]: torch::kBilinear 
Apr 06 19:55:54   - [variable]: torch::kBorder 
Apr 06 19:55:54   - [variable]: torch::kCircular

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 5 times.

Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/) [ghstack-poisoned]

Pull Request resolved: #36019 Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. ghstack-source-id: 101539079 Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)

gqchen

Thanks for debugging this, Pritam!

This looks good to me. I will let other folks take a look as well.

albanD · 2020-04-06T14:44:32Z

+    // propagate it further. If we don't do this, setErrorIfNeeded would just
+    // swallow the exception since the future is marked as completed.
+    if (graph_task->future_result_->completed()) {
+      std::rethrow_exception(std::current_exception());


But what happens if this is running on a different thread? We're going to end up in a bad state here no?

If it is running on a different thread, something upstream will catch the exception and LOG it. In case of at::launch(), this would be logged by the c10 ThreadPool. What sort of bad state are you worried about here?

What I am wondering is if this could kill our GPU worker threads? Leading to queues without any worker thread working on it.

Hmm, I didn't realize the worker threads don't handle exceptions appropriately. I guess we should handle exceptions in those threads since there are other things that could potentially throw apart from this. Although, for this PR I'll remove this line since the changes in setErrorIfNeeded would still log the error.

If it happens then the error needs to be caught and rethrown in the main thread.

future callback always happens in the same thread as the thread marking it complete?

Currently it does, but I don't think it is guaranteed.

Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/) [ghstack-poisoned]

Pull Request resolved: #36019 Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. ghstack-source-id: 101607329 Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)

albanD

The latest version of the PR does not match the description right?

Is the final answer that swallowing the exception is fine?
What is the expected behavior for callbacks in Future when an exception is raised during the callback? Where should it be thrown? cc @jjlilley

jjlilley · 2020-04-08T16:17:34Z

What is the expected behavior for callbacks in Future when an exception is raised during the callback? Where should it be thrown? cc @jjlilley

In most cases, the exception will be thrown in the thread/stack that called markCompleted().

This is not ideal, something that needs to be further addressed (e.g. just like exceptions in rrefs are still a bit suboptimal, for instance), though we've effectively been addressing call-by-call until a systematic solution is better undestood.

Re: this change - maybe an alternative would be be for setErrorIfNecessary to return true/false, whether the error was set? That way the caller can log if it makes sense, and we avoid spurious logs in the cases that don't want/need this.

jjlilley · 2020-04-08T16:27:40Z

A bit more: the callbacks to torch::utils::Future probably simply should not throw inline, but propagate the error in some other matter.

The design just doesn't currently handle that case well. That said, to handle it much better might start requiring complexity on order of folly::Future, which we don't necessarily want either - so it's not clear that we need urgent redesign either.

albanD

Thanks for all these details!

Following @jjlilley comment. In general, the callbacks should not throw inline and should propagate errors properly. So it is fine to swallow these errors.

In that context that looks good to me.
Maybe just do a quick check to see the perf impact of adding this log.

jjlilley · 2020-04-08T18:49:41Z

Please edit the commit message prior to landing :)

Long-term, it would be nice to just make setErrorIfNeeded just return a true/false, and delegate any logging to the caller.

There are only 4 uses of this function now, but maybe there will be others in the future that might not welcome the logging so much...

facebook-github-bot · 2020-04-10T00:25:14Z

This pull request has been merged in 14ce500.

Summary: Pull Request resolved: pytorch#36019 Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. ghstack-source-id: 101607329 Test Plan: Verified appropriate logging. Differential Revision: D20854806 fbshipit-source-id: 76bdf403cfd6d92f730ca1483ad5dba355f83e58

pritamdamania87 requested review from albanD and apaszke as code owners April 4, 2020 05:55

pritamdamania87 requested review from gqchen and wanchaol April 4, 2020 05:56

gqchen reviewed Apr 4, 2020

View reviewed changes

albanD reviewed Apr 6, 2020

View reviewed changes

pritamdamania87 requested a review from albanD April 7, 2020 01:36

gqchen approved these changes Apr 8, 2020

View reviewed changes

albanD requested changes Apr 8, 2020

View reviewed changes

albanD approved these changes Apr 8, 2020

View reviewed changes

pritamdamania87 changed the title ~~Appropriately handle exceptions in autograd engine.~~ Log error in setErrorIfNeeded Apr 9, 2020

facebook-github-bot closed this in 14ce500 Apr 9, 2020

facebook-github-bot added the merged label Apr 10, 2020

facebook-github-bot deleted the gh/pritamdamania87/112/head branch April 13, 2020 14:16

mruberry added the Merged label Oct 28, 2020

Conversation

pritamdamania87 commented Apr 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Apr 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_python_doc_push (1/2)

pytorch_cpp_doc_push (2/2)

Uh oh!

gqchen left a comment

Choose a reason for hiding this comment

Uh oh!

albanD Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

gqchen Apr 7, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Apr 8, 2020

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

jjlilley commented Apr 8, 2020

Uh oh!

jjlilley commented Apr 8, 2020

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

jjlilley commented Apr 8, 2020

Uh oh!

facebook-github-bot commented Apr 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pritamdamania87 commented Apr 4, 2020 •

edited

Loading

dr-ci Bot commented Apr 4, 2020 •

edited

Loading