Skip to content

[train] Abort reconciliation thread catches ray.util.state.get_actor exception#56600

Merged
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/catch-abort-get-actor
Sep 17, 2025
Merged

[train] Abort reconciliation thread catches ray.util.state.get_actor exception#56600
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/catch-abort-get-actor

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

Summary

@justinvyu observed that ray.util.get_actor in the TrainStateActor sometimes raises ray.util.state.exception.ServerUnavailable, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted.

This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll.

Testing

Unit test.

…exception

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner September 16, 2025 23:13
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses an issue where the abort reconciliation thread could crash due to a ServerUnavailable exception from ray.util.get_actor. The fix, which involves catching this exception, is correct and prevents the thread from dying, ensuring subsequent runs can still be marked as aborted. The accompanying unit test is well-written and validates the fix. I've suggested a small improvement to make the exception handling more robust by also catching DataSourceUnavailable, which seems to be another transient error from the state API.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Sep 17, 2025
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix!

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinvyu justinvyu enabled auto-merge (squash) September 17, 2025 20:51
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 17, 2025
@justinvyu justinvyu merged commit 416e365 into ray-project:master Sep 17, 2025
7 checks passed
zma2 pushed a commit to zma2/ray that referenced this pull request Sep 23, 2025
…exception (ray-project#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Zhiqiang Ma <zhiqiang.ma@intel.com>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…exception (ray-project#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Sep 24, 2025
…exception (#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
…exception (ray-project#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
…exception (#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…exception (#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…exception (ray-project#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…exception (ray-project#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…exception (ray-project#56600)

`ray.util.get_actor` in the `TrainStateActor` sometimes raises
`ray.util.state.exception.ServerUnavailable`, killing
the abort reconciliation thread and causing subsequent train runs to not
get marked aborted.

This change catches those errors so the thread stays alive and tries to
reconcile the train run on the next poll.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants