Skip to content

Log error in setErrorIfNeeded#36019

Closed
pritamdamania87 wants to merge 3 commits intogh/pritamdamania87/112/basefrom
gh/pritamdamania87/112/head
Closed

Log error in setErrorIfNeeded#36019
pritamdamania87 wants to merge 3 commits intogh/pritamdamania87/112/basefrom
gh/pritamdamania87/112/head

Conversation

@pritamdamania87
Copy link
Copy Markdown
Contributor

@pritamdamania87 pritamdamania87 commented Apr 4, 2020

Stack from ghstack:

Once the autograd engine is finished with a GraphTask it would call
markCompleted on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this in setErrorIfNeeded, we log the error if we are ignoring it.

Differential Revision: D20854806

Once the autograd engine is finished with a GraphTask it would call
`markCompleted` on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this, we do
the following:

1) Rethrow the exception in `mark_graph_task_completed`.
2) In `setErrorIfNeeded`, log the error if we are ignoring it.

Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)

[ghstack-poisoned]
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Apr 4, 2020

💊 CircleCI build failures summary and remediations

As of commit 04aff0b (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_python_doc_push (1/2)

Step: "Doc Build and Push" (full log | pattern match details)

Apr 06 19:55:19 Could not import extension javasphinx (exception: cannot import name 'l_')
Apr 06 19:55:18 Saved activation image for GELU() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/GELU.png 
Apr 06 19:55:18 Saved activation image for Sigmoid() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Sigmoid.png 
Apr 06 19:55:18 Saved activation image for Softplus(beta=1, threshold=20) at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Softplus.png 
Apr 06 19:55:18 Saved activation image for Softshrink(0.5) at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Softshrink.png 
Apr 06 19:55:18 Saved activation image for Softsign() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Softsign.png 
Apr 06 19:55:18 Saved activation image for Tanh() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Tanh.png 
Apr 06 19:55:18 Saved activation image for Tanhshrink() at /var/lib/jenkins/workspace/docs/source/scripts/activation_images/Tanhshrink.png 
Apr 06 19:55:18 Running Sphinx v3.0.0 
Apr 06 19:55:19  
Apr 06 19:55:19 Extension error: 
Apr 06 19:55:19 Could not import extension javasphinx (exception: cannot import name 'l_') 
Apr 06 19:55:19 make: *** [html] Error 2 
Apr 06 19:55:19 Makefile:38: recipe for target 'html' failed 

See CircleCI build pytorch_cpp_doc_push (2/2)

Step: "Doc Build and Push" (full log | pattern match details)

Apr 06 19:55:54 - [class]: torch::serialize::OutputArchive
Apr 06 19:55:54     - [function]: torch::python::bind_module 
Apr 06 19:55:54     - [function]: torch::python::bind_module 
Apr 06 19:55:54     - [function]: torch::python::init_bindings 
Apr 06 19:55:54     - [namespace]: torch::python::detail 
Apr 06 19:55:54       - [function]: torch::python::detail::bind_cpp_module_wrapper 
Apr 06 19:55:54       - [function]: torch::python::detail::py_object_to_device 
Apr 06 19:55:54       - [function]: torch::python::detail::py_object_to_dtype 
Apr 06 19:55:54       - [typedef]: torch::python::detail::PyModuleClass 
Apr 06 19:55:54   - [namespace]: torch::serialize 
Apr 06 19:55:54     - [class]: torch::serialize::InputArchive 
Apr 06 19:55:54     - [class]: torch::serialize::OutputArchive 
Apr 06 19:55:54   - [typedef]: torch::AutoGradMode 
Apr 06 19:55:54   - [typedef]: torch::Deleter 
Apr 06 19:55:54   - [typedef]: torch::Dtype 
Apr 06 19:55:54   - [typedef]: torch::NoGradGuard 
Apr 06 19:55:54   - [variable]: torch::kArea 
Apr 06 19:55:54   - [variable]: torch::kBatchMean 
Apr 06 19:55:54   - [variable]: torch::kBicubic 
Apr 06 19:55:54   - [variable]: torch::kBilinear 
Apr 06 19:55:54   - [variable]: torch::kBorder 
Apr 06 19:55:54   - [variable]: torch::kCircular 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 5 times.

Once the autograd engine is finished with a GraphTask it would call
`markCompleted` on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this, we do
the following:

1) Rethrow the exception in `mark_graph_task_completed`.
2) In `setErrorIfNeeded`, log the error if we are ignoring it.

Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Apr 4, 2020
Pull Request resolved: #36019

Once the autograd engine is finished with a GraphTask it would call
`markCompleted` on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this, we do
the following:

1) Rethrow the exception in `mark_graph_task_completed`.
2) In `setErrorIfNeeded`, log the error if we are ignoring it.
ghstack-source-id: 101539079

Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)
Copy link
Copy Markdown
Member

@gqchen gqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for debugging this, Pritam!

This looks good to me. I will let other folks take a look as well.

Comment thread torch/csrc/autograd/engine.cpp Outdated
// propagate it further. If we don't do this, setErrorIfNeeded would just
// swallow the exception since the future is marked as completed.
if (graph_task->future_result_->completed()) {
std::rethrow_exception(std::current_exception());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what happens if this is running on a different thread? We're going to end up in a bad state here no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is running on a different thread, something upstream will catch the exception and LOG it. In case of at::launch(), this would be logged by the c10 ThreadPool. What sort of bad state are you worried about here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am wondering is if this could kill our GPU worker threads? Leading to queues without any worker thread working on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I didn't realize the worker threads don't handle exceptions appropriately. I guess we should handle exceptions in those threads since there are other things that could potentially throw apart from this. Although, for this PR I'll remove this line since the changes in setErrorIfNeeded would still log the error.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it happens then the error needs to be caught and rethrown in the main thread.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future callback always happens in the same thread as the thread marking it complete?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it does, but I don't think it is guaranteed.

Once the autograd engine is finished with a GraphTask it would call
`markCompleted` on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this, we do
the following:

1) Rethrow the exception in `mark_graph_task_completed`.
2) In `setErrorIfNeeded`, log the error if we are ignoring it.

Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Apr 6, 2020
Pull Request resolved: #36019

Once the autograd engine is finished with a GraphTask it would call
`markCompleted` on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this, we do
the following:

1) Rethrow the exception in `mark_graph_task_completed`.
2) In `setErrorIfNeeded`, log the error if we are ignoring it.
ghstack-source-id: 101607329

Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)
@pritamdamania87 pritamdamania87 requested a review from albanD April 7, 2020 01:36
Copy link
Copy Markdown
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version of the PR does not match the description right?

Is the final answer that swallowing the exception is fine?
What is the expected behavior for callbacks in Future when an exception is raised during the callback? Where should it be thrown? cc @jjlilley

@jjlilley
Copy link
Copy Markdown

jjlilley commented Apr 8, 2020

What is the expected behavior for callbacks in Future when an exception is raised during the callback? Where should it be thrown? cc @jjlilley

In most cases, the exception will be thrown in the thread/stack that called markCompleted().

This is not ideal, something that needs to be further addressed (e.g. just like exceptions in rrefs are still a bit suboptimal, for instance), though we've effectively been addressing call-by-call until a systematic solution is better undestood.

Re: this change - maybe an alternative would be be for setErrorIfNecessary to return true/false, whether the error was set? That way the caller can log if it makes sense, and we avoid spurious logs in the cases that don't want/need this.

@jjlilley
Copy link
Copy Markdown

jjlilley commented Apr 8, 2020

A bit more: the callbacks to torch::utils::Future probably simply should not throw inline, but propagate the error in some other matter.

The design just doesn't currently handle that case well. That said, to handle it much better might start requiring complexity on order of folly::Future, which we don't necessarily want either - so it's not clear that we need urgent redesign either.

Copy link
Copy Markdown
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all these details!

Following @jjlilley comment. In general, the callbacks should not throw inline and should propagate errors properly. So it is fine to swallow these errors.

In that context that looks good to me.
Maybe just do a quick check to see the perf impact of adding this log.

@jjlilley
Copy link
Copy Markdown

jjlilley commented Apr 8, 2020

Please edit the commit message prior to landing :)

Long-term, it would be nice to just make setErrorIfNeeded just return a true/false, and delegate any logging to the caller.

There are only 4 uses of this function now, but maybe there will be others in the future that might not welcome the logging so much...

@pritamdamania87 pritamdamania87 changed the title Appropriately handle exceptions in autograd engine. Log error in setErrorIfNeeded Apr 9, 2020
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 14ce500.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/112/head branch April 13, 2020 14:16
ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Apr 13, 2020
Summary:
Pull Request resolved: pytorch#36019

Once the autograd engine is finished with a GraphTask it would call
`markCompleted` on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this, we do
the following:

1) Rethrow the exception in `mark_graph_task_completed`.
2) In `setErrorIfNeeded`, log the error if we are ignoring it.
ghstack-source-id: 101607329

Test Plan: Verified appropriate logging.

Differential Revision: D20854806

fbshipit-source-id: 76bdf403cfd6d92f730ca1483ad5dba355f83e58
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#36019

Once the autograd engine is finished with a GraphTask it would call
`markCompleted` on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this, we do
the following:

1) Rethrow the exception in `mark_graph_task_completed`.
2) In `setErrorIfNeeded`, log the error if we are ignoring it.
ghstack-source-id: 101607329

Test Plan: Verified appropriate logging.

Differential Revision: D20854806

fbshipit-source-id: 76bdf403cfd6d92f730ca1483ad5dba355f83e58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants