Log error in setErrorIfNeeded#36019
Log error in setErrorIfNeeded#36019pritamdamania87 wants to merge 3 commits intogh/pritamdamania87/112/basefrom
Conversation
Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/) [ghstack-poisoned]
💊 CircleCI build failures summary and remediationsAs of commit 04aff0b (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following build failures do not appear to be due to upstream breakages:
|
Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/) [ghstack-poisoned]
Pull Request resolved: #36019 Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. ghstack-source-id: 101539079 Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)
gqchen
left a comment
There was a problem hiding this comment.
Thanks for debugging this, Pritam!
This looks good to me. I will let other folks take a look as well.
| // propagate it further. If we don't do this, setErrorIfNeeded would just | ||
| // swallow the exception since the future is marked as completed. | ||
| if (graph_task->future_result_->completed()) { | ||
| std::rethrow_exception(std::current_exception()); |
There was a problem hiding this comment.
But what happens if this is running on a different thread? We're going to end up in a bad state here no?
There was a problem hiding this comment.
If it is running on a different thread, something upstream will catch the exception and LOG it. In case of at::launch(), this would be logged by the c10 ThreadPool. What sort of bad state are you worried about here?
There was a problem hiding this comment.
What I am wondering is if this could kill our GPU worker threads? Leading to queues without any worker thread working on it.
There was a problem hiding this comment.
Hmm, I didn't realize the worker threads don't handle exceptions appropriately. I guess we should handle exceptions in those threads since there are other things that could potentially throw apart from this. Although, for this PR I'll remove this line since the changes in setErrorIfNeeded would still log the error.
There was a problem hiding this comment.
If it happens then the error needs to be caught and rethrown in the main thread.
There was a problem hiding this comment.
future callback always happens in the same thread as the thread marking it complete?
There was a problem hiding this comment.
Currently it does, but I don't think it is guaranteed.
Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/) [ghstack-poisoned]
Pull Request resolved: #36019 Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. ghstack-source-id: 101607329 Differential Revision: [D20854806](https://our.internmc.facebook.com/intern/diff/D20854806/)
albanD
left a comment
There was a problem hiding this comment.
The latest version of the PR does not match the description right?
Is the final answer that swallowing the exception is fine?
What is the expected behavior for callbacks in Future when an exception is raised during the callback? Where should it be thrown? cc @jjlilley
In most cases, the exception will be thrown in the thread/stack that called markCompleted(). This is not ideal, something that needs to be further addressed (e.g. just like exceptions in rrefs are still a bit suboptimal, for instance), though we've effectively been addressing call-by-call until a systematic solution is better undestood. Re: this change - maybe an alternative would be be for setErrorIfNecessary to return true/false, whether the error was set? That way the caller can log if it makes sense, and we avoid spurious logs in the cases that don't want/need this. |
|
A bit more: the callbacks to torch::utils::Future probably simply should not throw inline, but propagate the error in some other matter. The design just doesn't currently handle that case well. That said, to handle it much better might start requiring complexity on order of folly::Future, which we don't necessarily want either - so it's not clear that we need urgent redesign either. |
albanD
left a comment
There was a problem hiding this comment.
Thanks for all these details!
Following @jjlilley comment. In general, the callbacks should not throw inline and should propagate errors properly. So it is fine to swallow these errors.
In that context that looks good to me.
Maybe just do a quick check to see the perf impact of adding this log.
|
Please edit the commit message prior to landing :) Long-term, it would be nice to just make setErrorIfNeeded just return a true/false, and delegate any logging to the caller. There are only 4 uses of this function now, but maybe there will be others in the future that might not welcome the logging so much... |
|
This pull request has been merged in 14ce500. |
Summary: Pull Request resolved: pytorch#36019 Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. ghstack-source-id: 101607329 Test Plan: Verified appropriate logging. Differential Revision: D20854806 fbshipit-source-id: 76bdf403cfd6d92f730ca1483ad5dba355f83e58
Summary: Pull Request resolved: pytorch#36019 Once the autograd engine is finished with a GraphTask it would call `markCompleted` on the Future. This could trigger callbacks on the Future that could throw exceptions. If one of the callbacks did throw an exception, we would call setErrorIfNeeded, which would be no-op since the Future is already marked as completed. This would effectively mean we would be swallowing exceptions. To avoid this, we do the following: 1) Rethrow the exception in `mark_graph_task_completed`. 2) In `setErrorIfNeeded`, log the error if we are ignoring it. ghstack-source-id: 101607329 Test Plan: Verified appropriate logging. Differential Revision: D20854806 fbshipit-source-id: 76bdf403cfd6d92f730ca1483ad5dba355f83e58
Stack from ghstack:
Once the autograd engine is finished with a GraphTask it would call
markCompletedon the Future. This could trigger callbacks on the Future thatcould throw exceptions.
If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this in
setErrorIfNeeded, we log the error if we are ignoring it.Differential Revision: D20854806