[train] ThreadRunner captures exceptions from nested threads#55756
[train] ThreadRunner captures exceptions from nested threads#55756justinvyu merged 9 commits intoray-project:masterfrom
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
python/ray/train/v2/_internal/execution/worker_group/thread_runner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/thread_runner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
python/ray/train/v2/_internal/execution/worker_group/thread_runner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/thread_runner.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/worker_group/thread_runner.py
Outdated
Show resolved
Hide resolved
Good question - I decided to make nested thread cleanup the target function's responsibility. In particular, I implemented that as part of the async checkpoint upload PR (#55637) - here is some relevant text from the PR description
|
Signed-off-by: Timothy Seah <tseah@anyscale.com>
…cs, remove exclude_frames test Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>
…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
Summary
The
ThreadRunneris an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends theThreadRunnerto also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (#55637).Here is a rough sketch of the implementation:
ThreadRunnerhas queue to receive exceptions from nested threadsThreadRunnerhas monitoring thread that reads from this exceptions queue and updates/joins theThreadRunneraccordingly.Below is a diagram showing how all these components connect with each other:
We chose to do it this way rather than making it the user's responsibility to handle everything (for example,
ray.train.reportcould wait for the error and raise it on the next iteration) so we can return the result and/or exception as soon as possible.Because of this, the semantics of
is_runningandjoinare more complicated than before - refer to the corresponding function docstrings for more information.Testing
Unit tests