[train] TrainController reraises AsyncioActorExit#59461
Merged
matthewdeng merged 2 commits intoray-project:masterfrom Dec 16, 2025
Merged
[train] TrainController reraises AsyncioActorExit#59461matthewdeng merged 2 commits intoray-project:masterfrom
matthewdeng merged 2 commits intoray-project:masterfrom
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an issue where AsyncioActorExit was being caught and re-raised as a ControllerError, preventing the actor from aborting immediately. By explicitly checking for AsyncioActorExit and re-raising it, the change ensures that the TrainController exits as intended when ray.actor.exit_actor() is called. The import of AsyncioActorExit is also correctly added. The changes are concise and directly resolve the described problem, improving the reliability of actor termination.
matthewdeng
reviewed
Dec 16, 2025
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>
matthewdeng
approved these changes
Dec 16, 2025
Comment on lines
+435
to
+436
| except AsyncioActorExit: | ||
| raise |
Contributor
There was a problem hiding this comment.
nit: Might be good to just add a comment in-line explaining the rationale.
cszhu
pushed a commit
that referenced
this pull request
Dec 17, 2025
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in #58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
zzchun
pushed a commit
to zzchun/ray
that referenced
this pull request
Dec 18, 2025
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in ray-project#58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Yicheng-Lu-llll
pushed a commit
to Yicheng-Lu-llll/ray
that referenced
this pull request
Dec 22, 2025
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in ray-project#58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in ray-project#58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@justinvyu noticed the following logs
The problem is that the fallback I implemented in #58287 didn't work because the
TrainControllercaught theAsyncioActorExitraised byray.actor.exit_actorand handled it as aControllerError. However, what we actually want is to finish the abort asap by reraising the exception.Testing
Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce.