[train] after_worker_group_poll_status errors result in ControllerError by TimothySeah · Pull Request #57869 · ray-project/ray

TimothySeah · 2025-10-18T01:09:43Z

Summary

We observed that whenever after_worker_group_poll_status raised an exception, the Train Run would fail ungracefully and show up as ABORTED in the dashboard. This happened in the following situations:

Different workers report remote checkpoints with different paths -> (TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} -> ABORTED Train Run
ray.train.report("loss": ...}, checkpoint=checkpoint) in train_func -> TypeError: Object of type 'ellipsis' is not JSON serializable in CheckpointManager._save_state -> ABORTED Train Run

This PR catches these exceptions, wraps them in a ControllerError, and goes through the FailurePolicy, ultimately resulting in an ERRORED Train Run, which is more intuitive because it happened due to an error in the training workers (The Train run failed due to an error in the training workers. is the comment associated with RunStatus.ERRORED).

I considered implementing a more general solution that caught all WorkerGroupCallback errors and resurfaced them as ControllerErrors, but decided against it because:

Callbacks occur in many different places and we might want to add custom try/catch logic in each case.
after_worker_group_poll_status is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in ABORTED

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request aims to gracefully handle exceptions from after_worker_group_poll_status callbacks by wrapping them in a ControllerError. The changes achieve this by modifying WorkerGroup.poll_status to catch and return exceptions from callbacks. The TrainController is updated to handle this new return value, correctly identifying these exceptions as controller-level failures. The changes are well-tested, with updates to existing tests and a new test case specifically for callback exceptions. I found one minor issue in a test case where an exception class was used instead of an instance to simulate a failure. Overall, this is a good change that improves error handling and robustness.

python/ray/train/v2/tests/test_controller.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…catch-callback-errors

python/ray/train/v2/_internal/execution/controller/controller.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…or (#57869) # Summary We observed that whenever `after_worker_group_poll_status` raised an exception, the Train Run would fail ungracefully and show up as `ABORTED` in the dashboard. This happened in the following situations: 1) Different workers report remote checkpoints with different paths -> `(TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run 2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in `train_func` -> `TypeError: Object of type 'ellipsis' is not JSON serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run This PR catches these exceptions, wraps them in a `ControllerError`, and goes through the `FailurePolicy`, ultimately resulting in an `ERRORED` Train Run, which is more intuitive because it happened due to an error in the training workers (`The Train run failed due to an error in the training workers.` is the comment associated with `RunStatus.ERRORED`). I considered implementing a more general solution that caught all `WorkerGroupCallback` errors and resurfaced them as `ControllerError`s, but decided against it because: * Callbacks occur in many different places and we might want to add custom try/catch logic in each case. * `after_worker_group_poll_status` is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in `ABORTED` # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…or (ray-project#57869) # Summary We observed that whenever `after_worker_group_poll_status` raised an exception, the Train Run would fail ungracefully and show up as `ABORTED` in the dashboard. This happened in the following situations: 1) Different workers report remote checkpoints with different paths -> `(TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run 2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in `train_func` -> `TypeError: Object of type 'ellipsis' is not JSON serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run This PR catches these exceptions, wraps them in a `ControllerError`, and goes through the `FailurePolicy`, ultimately resulting in an `ERRORED` Train Run, which is more intuitive because it happened due to an error in the training workers (`The Train run failed due to an error in the training workers.` is the comment associated with `RunStatus.ERRORED`). I considered implementing a more general solution that caught all `WorkerGroupCallback` errors and resurfaced them as `ControllerError`s, but decided against it because: * Callbacks occur in many different places and we might want to add custom try/catch logic in each case. * `after_worker_group_poll_status` is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in `ABORTED` # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…or (ray-project#57869) # Summary We observed that whenever `after_worker_group_poll_status` raised an exception, the Train Run would fail ungracefully and show up as `ABORTED` in the dashboard. This happened in the following situations: 1) Different workers report remote checkpoints with different paths -> `(TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run 2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in `train_func` -> `TypeError: Object of type 'ellipsis' is not JSON serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run This PR catches these exceptions, wraps them in a `ControllerError`, and goes through the `FailurePolicy`, ultimately resulting in an `ERRORED` Train Run, which is more intuitive because it happened due to an error in the training workers (`The Train run failed due to an error in the training workers.` is the comment associated with `RunStatus.ERRORED`). I considered implementing a more general solution that caught all `WorkerGroupCallback` errors and resurfaced them as `ControllerError`s, but decided against it because: * Callbacks occur in many different places and we might want to add custom try/catch logic in each case. * `after_worker_group_poll_status` is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in `ABORTED` # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…or (ray-project#57869) # Summary We observed that whenever `after_worker_group_poll_status` raised an exception, the Train Run would fail ungracefully and show up as `ABORTED` in the dashboard. This happened in the following situations: 1) Different workers report remote checkpoints with different paths -> `(TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run 2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in `train_func` -> `TypeError: Object of type 'ellipsis' is not JSON serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run This PR catches these exceptions, wraps them in a `ControllerError`, and goes through the `FailurePolicy`, ultimately resulting in an `ERRORED` Train Run, which is more intuitive because it happened due to an error in the training workers (`The Train run failed due to an error in the training workers.` is the comment associated with `RunStatus.ERRORED`). I considered implementing a more general solution that caught all `WorkerGroupCallback` errors and resurfaced them as `ControllerError`s, but decided against it because: * Callbacks occur in many different places and we might want to add custom try/catch logic in each case. * `after_worker_group_poll_status` is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in `ABORTED` # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

…or (ray-project#57869) # Summary We observed that whenever `after_worker_group_poll_status` raised an exception, the Train Run would fail ungracefully and show up as `ABORTED` in the dashboard. This happened in the following situations: 1) Different workers report remote checkpoints with different paths -> `(TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run 2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in `train_func` -> `TypeError: Object of type 'ellipsis' is not JSON serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run This PR catches these exceptions, wraps them in a `ControllerError`, and goes through the `FailurePolicy`, ultimately resulting in an `ERRORED` Train Run, which is more intuitive because it happened due to an error in the training workers (`The Train run failed due to an error in the training workers.` is the comment associated with `RunStatus.ERRORED`). I considered implementing a more general solution that caught all `WorkerGroupCallback` errors and resurfaced them as `ControllerError`s, but decided against it because: * Callbacks occur in many different places and we might want to add custom try/catch logic in each case. * `after_worker_group_poll_status` is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in `ABORTED` # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…or (ray-project#57869) # Summary We observed that whenever `after_worker_group_poll_status` raised an exception, the Train Run would fail ungracefully and show up as `ABORTED` in the dashboard. This happened in the following situations: 1) Different workers report remote checkpoints with different paths -> `(TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run 2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in `train_func` -> `TypeError: Object of type 'ellipsis' is not JSON serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run This PR catches these exceptions, wraps them in a `ControllerError`, and goes through the `FailurePolicy`, ultimately resulting in an `ERRORED` Train Run, which is more intuitive because it happened due to an error in the training workers (`The Train run failed due to an error in the training workers.` is the comment associated with `RunStatus.ERRORED`). I considered implementing a more general solution that caught all `WorkerGroupCallback` errors and resurfaced them as `ControllerError`s, but decided against it because: * Callbacks occur in many different places and we might want to add custom try/catch logic in each case. * `after_worker_group_poll_status` is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in `ABORTED` # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[train] after_worker_group_poll_status errors result in ControllerError

6244736

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from a team as a code owner October 18, 2025 01:09

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

python/ray/train/v2/tests/test_controller.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

fix unit test

fccb157

Signed-off-by: Timothy Seah <tseah@anyscale.com>

This comment was marked as outdated.

Sign in to view

ray-gardener bot added the train Ray Train Related Issue label Oct 18, 2025

TimothySeah added 2 commits October 17, 2025 18:42

remove test env pollution

ec7555c

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tseah/controller-…

fd9fb25

…catch-callback-errors

matthewdeng reviewed Oct 20, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Outdated Show resolved Hide resolved

TimothySeah added 2 commits October 21, 2025 14:16

address comments

11ce17b

Signed-off-by: Timothy Seah <tseah@anyscale.com>

fix typing

eb70805

Signed-off-by: Timothy Seah <tseah@anyscale.com>

This comment was marked as outdated.

Sign in to view

fix unit test again

be7b178

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from matthewdeng October 21, 2025 21:24

matthewdeng approved these changes Oct 21, 2025

View reviewed changes

matthewdeng enabled auto-merge (squash) October 21, 2025 22:56

github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 21, 2025

matthewdeng merged commit f5abbb8 into ray-project:master Oct 22, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] after_worker_group_poll_status errors result in ControllerError#57869

[train] after_worker_group_poll_status errors result in ControllerError#57869
matthewdeng merged 7 commits intoray-project:masterfrom
TimothySeah:tseah/controller-catch-callback-errors

TimothySeah commented Oct 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TimothySeah commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TimothySeah commented Oct 18, 2025 •

edited

Loading