[train] Abort reconciliation thread catches ray.util.state.get_actor exception#56600
Merged
justinvyu merged 2 commits intoray-project:masterfrom Sep 17, 2025
Merged
Conversation
…exception Signed-off-by: Timothy Seah <tseah@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request effectively addresses an issue where the abort reconciliation thread could crash due to a ServerUnavailable exception from ray.util.get_actor. The fix, which involves catching this exception, is correct and prevents the thread from dying, ensuring subsequent runs can still be marked as aborted. The accompanying unit test is well-written and validates the fix. I've suggested a small improvement to make the exception handling more robust by also catching DataSourceUnavailable, which seems to be another transient error from the state API.
justinvyu
reviewed
Sep 17, 2025
Contributor
justinvyu
left a comment
There was a problem hiding this comment.
Thanks for the quick fix!
Signed-off-by: Timothy Seah <tseah@anyscale.com>
zma2
pushed a commit
to zma2/ray
that referenced
this pull request
Sep 23, 2025
…exception (ray-project#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Zhiqiang Ma <zhiqiang.ma@intel.com>
ZacAttack
pushed a commit
to ZacAttack/ray
that referenced
this pull request
Sep 24, 2025
…exception (ray-project#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
elliot-barn
pushed a commit
that referenced
this pull request
Sep 24, 2025
…exception (#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
marcostephan
pushed a commit
to marcostephan/ray
that referenced
this pull request
Sep 24, 2025
…exception (ray-project#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn
pushed a commit
that referenced
this pull request
Sep 27, 2025
…exception (#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman
pushed a commit
that referenced
this pull request
Oct 6, 2025
…exception (#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995
pushed a commit
to justinyeh1995/ray
that referenced
this pull request
Oct 20, 2025
…exception (ray-project#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…exception (ray-project#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…exception (ray-project#56600) `ray.util.get_actor` in the `TrainStateActor` sometimes raises `ray.util.state.exception.ServerUnavailable`, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted. This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@justinvyu observed that
ray.util.get_actorin theTrainStateActorsometimes raisesray.util.state.exception.ServerUnavailable, killing the abort reconciliation thread and causing subsequent train runs to not get marked aborted.This change catches those errors so the thread stays alive and tries to reconcile the train run on the next poll.
Testing
Unit test.