Skip to content

[train] ThreadRunner captures exceptions from nested threads#55756

Merged
justinvyu merged 9 commits intoray-project:masterfrom
TimothySeah:tseah/thread-runner-nested
Aug 26, 2025
Merged

[train] ThreadRunner captures exceptions from nested threads#55756
justinvyu merged 9 commits intoray-project:masterfrom
TimothySeah:tseah/thread-runner-nested

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Aug 19, 2025

Summary

The ThreadRunner is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the ThreadRunner to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (#55637).

Here is a rough sketch of the implementation:

  • ThreadRunner has queue to receive exceptions from nested threads
  • Nested threads catch and put their exceptions on this queue
  • ThreadRunner has monitoring thread that reads from this exceptions queue and updates/joins the ThreadRunner accordingly.

Below is a diagram showing how all these components connect with each other:

Screenshot 2025-08-20 at 4 25 23 PM

We chose to do it this way rather than making it the user's responsibility to handle everything (for example, ray.train.report could wait for the error and raise it on the next iteration) so we can return the result and/or exception as soon as possible.

Because of this, the semantics of is_running and join are more complicated than before - refer to the corresponding function docstrings for more information.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah marked this pull request as ready for review August 20, 2025 01:14
@TimothySeah TimothySeah requested a review from a team as a code owner August 20, 2025 01:14
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Aug 20, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any issues if the user train function finishes before the nested thread? Ex: I launched some uploads threads but never wait on them to finish at the end of the train fn. Do those nested threads get killed up if the parent thread exits?

@TimothySeah
Copy link
Copy Markdown
Contributor Author

Any issues if the user train function finishes before the nested thread? Ex: I launched some uploads threads but never wait on them to finish at the end of the train fn. Do those nested threads get killed up if the parent thread exits?

Good question - I decided to make nested thread cleanup the target function's responsibility. In particular, I implemented that as part of the async checkpoint upload PR (#55637) - here is some relevant text from the PR description

I changed run_train_fn to wrap the train_fn in train_fn_that_waits_for_threads because otherwise, we could be in the following situation: 1) train function exits with pending report threads and worker status is finished 2) controller sees finished status and shuts down worker group 3) result.fit does not return all the reported checkpoints/metrics
I decided to implement #55756 in ThreadRunner but "wait for threads" as a wrapper function because in the former case, that is the cleanest way for a nested thread to cause the entire worker to exit early, but in this case, the target function is able to wait for the threads that it creates without complicating the ThreadRunner abstraction.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
…cs, remove exclude_frames test

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Aug 26, 2025
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@justinvyu justinvyu enabled auto-merge (squash) August 26, 2025 22:16
@justinvyu justinvyu merged commit 4f95dfb into ray-project:master Aug 26, 2025
6 of 7 checks passed
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants