[train] log errors raised by worker#52223
Conversation
Signed-off-by: Matthew Deng <matt@anyscale.com>
|
|
||
| error = execution_context.training_thread_runner.get_error() | ||
| if error: | ||
| logger.error(f"Error in training function:\n{error}") |
There was a problem hiding this comment.
This prints the error stacktrace due to the UserExceptionWithTraceback wrapper that we add right?
There was a problem hiding this comment.
Oh yes you are right! Updated to the location you linked above and just printed out the stacktrace directly.
Signed-off-by: Matthew Deng <matt@anyscale.com>
|
@matthewdeng Does this now show the worker error 2x in the console? Once upon the thread raising and once in |
|
Yeah it shows up 2x for me on my laptop but I believe it would not be shown in the main driver console logs if it is raised from a separate worker node. |
What's up with the 1/0 and new line? This doesn't look like the other tracebacks above |
|
That is a good question... something to do with worker log propagation to console misordering? The output written to the log file is properly formatted. Let me test it some more. |
|
I was able to isolate the reproduction for the reordering behavior, it seems there are two requirements:
Given this extremely small set of criteria, I'm going to consider this non-blocking at merge. |
When an error happens in the user training function, it is raised in a separate thread and not currently logged. This propagates the error to be logged when the thread completes with an error. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Steve Han <stevehan2001@gmail.com>
Why are these changes needed?
When an error happens in the user training function, it is raised in a separate thread and not currently logged.
This propagates the error to be logged when the thread completes with an error.
Checks