Skip to content

[train] TrainController reraises AsyncioActorExit#59461

Merged
matthewdeng merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/do-not-catch-abort
Dec 16, 2025
Merged

[train] TrainController reraises AsyncioActorExit#59461
matthewdeng merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/do-not-catch-abort

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Dec 16, 2025

Summary

@justinvyu noticed the following logs

(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437) 
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437) 
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.

The problem is that the fallback I implemented in #58287 didn't work because the TrainController caught the AsyncioActorExit raised by ray.actor.exit_actor and handled it as a ControllerError. However, what we actually want is to finish the abort asap by reraising the exception.

Testing

Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah marked this pull request as ready for review December 16, 2025 02:28
@TimothySeah TimothySeah requested a review from a team as a code owner December 16, 2025 02:28
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue where AsyncioActorExit was being caught and re-raised as a ControllerError, preventing the actor from aborting immediately. By explicitly checking for AsyncioActorExit and re-raising it, the change ensures that the TrainController exits as intended when ray.actor.exit_actor() is called. The import of AsyncioActorExit is also correctly added. The changes are concise and directly resolve the described problem, improving the reliability of actor termination.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Dec 16, 2025
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Dec 16, 2025
@matthewdeng matthewdeng enabled auto-merge (squash) December 16, 2025 19:07
Comment on lines +435 to +436
except AsyncioActorExit:
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Might be good to just add a comment in-line explaining the rationale.

@matthewdeng matthewdeng merged commit 0820e69 into ray-project:master Dec 16, 2025
7 of 8 checks passed
cszhu pushed a commit that referenced this pull request Dec 17, 2025
# Summary

@justinvyu noticed the following logs 

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437) 
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437) 
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
zzchun pushed a commit to zzchun/ray that referenced this pull request Dec 18, 2025
# Summary

@justinvyu noticed the following logs 

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437) 
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437) 
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
ray-project#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
# Summary

@justinvyu noticed the following logs 

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437) 
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437) 
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
ray-project#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
# Summary

@justinvyu noticed the following logs

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437)
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437)
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
ray-project#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants