Skip to content

[train] after_worker_group_poll_status errors result in ControllerError#57869

Merged
matthewdeng merged 7 commits intoray-project:masterfrom
TimothySeah:tseah/controller-catch-callback-errors
Oct 22, 2025
Merged

[train] after_worker_group_poll_status errors result in ControllerError#57869
matthewdeng merged 7 commits intoray-project:masterfrom
TimothySeah:tseah/controller-catch-callback-errors

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Oct 18, 2025

Summary

We observed that whenever after_worker_group_poll_status raised an exception, the Train Run would fail ungracefully and show up as ABORTED in the dashboard. This happened in the following situations:

  1. Different workers report remote checkpoints with different paths -> (TrainController pid=46993) RuntimeError: The storage path of the checkpoints in the training results is not the same. This means the checkpoints are not consistent. Got a mix of the following checkpoint paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} -> ABORTED Train Run
  2. ray.train.report("loss": ...}, checkpoint=checkpoint) in train_func -> TypeError: Object of type 'ellipsis' is not JSON serializable in CheckpointManager._save_state -> ABORTED Train Run

This PR catches these exceptions, wraps them in a ControllerError, and goes through the FailurePolicy, ultimately resulting in an ERRORED Train Run, which is more intuitive because it happened due to an error in the training workers (The Train run failed due to an error in the training workers. is the comment associated with RunStatus.ERRORED).

I considered implementing a more general solution that caught all WorkerGroupCallback errors and resurfaced them as ControllerErrors, but decided against it because:

  • Callbacks occur in many different places and we might want to add custom try/catch logic in each case.
  • after_worker_group_poll_status is the only offender so far and most of its errors are from user mistakes; other callback errors could be legitimate bugs that should result in ABORTED

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner October 18, 2025 01:09
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to gracefully handle exceptions from after_worker_group_poll_status callbacks by wrapping them in a ControllerError. The changes achieve this by modifying WorkerGroup.poll_status to catch and return exceptions from callbacks. The TrainController is updated to handle this new return value, correctly identifying these exceptions as controller-level failures. The changes are well-tested, with updates to existing tests and a new test case specifically for callback exceptions. I found one minor issue in a test case where an exception class was used instead of an instance to simulate a failure. Overall, this is a good change that improves error handling and robustness.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Oct 18, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@matthewdeng matthewdeng enabled auto-merge (squash) October 21, 2025 22:56
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 21, 2025
@matthewdeng matthewdeng merged commit f5abbb8 into ray-project:master Oct 22, 2025
8 checks passed
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…or (#57869)

# Summary

We observed that whenever `after_worker_group_poll_status` raised an
exception, the Train Run would fail ungracefully and show up as
`ABORTED` in the dashboard. This happened in the following situations:
1) Different workers report remote checkpoints with different paths ->
`(TrainController pid=46993) RuntimeError: The storage path of the
checkpoints in the training results is not the same. This means the
checkpoints are not consistent. Got a mix of the following checkpoint
paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run
2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in
`train_func` -> `TypeError: Object of type 'ellipsis' is not JSON
serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run

This PR catches these exceptions, wraps them in a `ControllerError`, and
goes through the `FailurePolicy`, ultimately resulting in an `ERRORED`
Train Run, which is more intuitive because it happened due to an error
in the training workers (`The Train run failed due to an error in the
training workers.` is the comment associated with `RunStatus.ERRORED`).

I considered implementing a more general solution that caught all
`WorkerGroupCallback` errors and resurfaced them as `ControllerError`s,
but decided against it because:
* Callbacks occur in many different places and we might want to add
custom try/catch logic in each case.
* `after_worker_group_poll_status` is the only offender so far and most
of its errors are from user mistakes; other callback errors could be
legitimate bugs that should result in `ABORTED`

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…or (ray-project#57869)

# Summary

We observed that whenever `after_worker_group_poll_status` raised an
exception, the Train Run would fail ungracefully and show up as
`ABORTED` in the dashboard. This happened in the following situations:
1) Different workers report remote checkpoints with different paths ->
`(TrainController pid=46993) RuntimeError: The storage path of the
checkpoints in the training results is not the same. This means the
checkpoints are not consistent. Got a mix of the following checkpoint
paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run
2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in
`train_func` -> `TypeError: Object of type 'ellipsis' is not JSON
serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run

This PR catches these exceptions, wraps them in a `ControllerError`, and
goes through the `FailurePolicy`, ultimately resulting in an `ERRORED`
Train Run, which is more intuitive because it happened due to an error
in the training workers (`The Train run failed due to an error in the
training workers.` is the comment associated with `RunStatus.ERRORED`).

I considered implementing a more general solution that caught all
`WorkerGroupCallback` errors and resurfaced them as `ControllerError`s,
but decided against it because:
* Callbacks occur in many different places and we might want to add
custom try/catch logic in each case.
* `after_worker_group_poll_status` is the only offender so far and most
of its errors are from user mistakes; other callback errors could be
legitimate bugs that should result in `ABORTED`

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…or (ray-project#57869)

# Summary

We observed that whenever `after_worker_group_poll_status` raised an
exception, the Train Run would fail ungracefully and show up as
`ABORTED` in the dashboard. This happened in the following situations:
1) Different workers report remote checkpoints with different paths ->
`(TrainController pid=46993) RuntimeError: The storage path of the
checkpoints in the training results is not the same. This means the
checkpoints are not consistent. Got a mix of the following checkpoint
paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run
2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in
`train_func` -> `TypeError: Object of type 'ellipsis' is not JSON
serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run

This PR catches these exceptions, wraps them in a `ControllerError`, and
goes through the `FailurePolicy`, ultimately resulting in an `ERRORED`
Train Run, which is more intuitive because it happened due to an error
in the training workers (`The Train run failed due to an error in the
training workers.` is the comment associated with `RunStatus.ERRORED`).

I considered implementing a more general solution that caught all
`WorkerGroupCallback` errors and resurfaced them as `ControllerError`s,
but decided against it because:
* Callbacks occur in many different places and we might want to add
custom try/catch logic in each case.
* `after_worker_group_poll_status` is the only offender so far and most
of its errors are from user mistakes; other callback errors could be
legitimate bugs that should result in `ABORTED`

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…or (ray-project#57869)

# Summary

We observed that whenever `after_worker_group_poll_status` raised an
exception, the Train Run would fail ungracefully and show up as
`ABORTED` in the dashboard. This happened in the following situations:
1) Different workers report remote checkpoints with different paths ->
`(TrainController pid=46993) RuntimeError: The storage path of the
checkpoints in the training results is not the same. This means the
checkpoints are not consistent. Got a mix of the following checkpoint
paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run
2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in
`train_func` -> `TypeError: Object of type 'ellipsis' is not JSON
serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run

This PR catches these exceptions, wraps them in a `ControllerError`, and
goes through the `FailurePolicy`, ultimately resulting in an `ERRORED`
Train Run, which is more intuitive because it happened due to an error
in the training workers (`The Train run failed due to an error in the
training workers.` is the comment associated with `RunStatus.ERRORED`).

I considered implementing a more general solution that caught all
`WorkerGroupCallback` errors and resurfaced them as `ControllerError`s,
but decided against it because:
* Callbacks occur in many different places and we might want to add
custom try/catch logic in each case.
* `after_worker_group_poll_status` is the only offender so far and most
of its errors are from user mistakes; other callback errors could be
legitimate bugs that should result in `ABORTED`

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Blaze-DSP pushed a commit to Blaze-DSP/ray that referenced this pull request Dec 18, 2025
…or (ray-project#57869)

# Summary

We observed that whenever `after_worker_group_poll_status` raised an
exception, the Train Run would fail ungracefully and show up as
`ABORTED` in the dashboard. This happened in the following situations:
1) Different workers report remote checkpoints with different paths ->
`(TrainController pid=46993) RuntimeError: The storage path of the
checkpoints in the training results is not the same. This means the
checkpoints are not consistent. Got a mix of the following checkpoint
paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run
2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in
`train_func` -> `TypeError: Object of type 'ellipsis' is not JSON
serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run

This PR catches these exceptions, wraps them in a `ControllerError`, and
goes through the `FailurePolicy`, ultimately resulting in an `ERRORED`
Train Run, which is more intuitive because it happened due to an error
in the training workers (`The Train run failed due to an error in the
training workers.` is the comment associated with `RunStatus.ERRORED`).

I considered implementing a more general solution that caught all
`WorkerGroupCallback` errors and resurfaced them as `ControllerError`s,
but decided against it because:
* Callbacks occur in many different places and we might want to add
custom try/catch logic in each case.
* `after_worker_group_poll_status` is the only offender so far and most
of its errors are from user mistakes; other callback errors could be
legitimate bugs that should result in `ABORTED`

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…or (ray-project#57869)

# Summary

We observed that whenever `after_worker_group_poll_status` raised an
exception, the Train Run would fail ungracefully and show up as
`ABORTED` in the dashboard. This happened in the following situations:
1) Different workers report remote checkpoints with different paths ->
`(TrainController pid=46993) RuntimeError: The storage path of the
checkpoints in the training results is not the same. This means the
checkpoints are not consistent. Got a mix of the following checkpoint
paths: {'/tmp/tmpl95kv7ax', '/tmp/tmp__8e6etk'} ` -> `ABORTED` Train Run
2) `ray.train.report("loss": ...}, checkpoint=checkpoint)` in
`train_func` -> `TypeError: Object of type 'ellipsis' is not JSON
serializable` in `CheckpointManager._save_state` -> `ABORTED` Train Run

This PR catches these exceptions, wraps them in a `ControllerError`, and
goes through the `FailurePolicy`, ultimately resulting in an `ERRORED`
Train Run, which is more intuitive because it happened due to an error
in the training workers (`The Train run failed due to an error in the
training workers.` is the comment associated with `RunStatus.ERRORED`).

I considered implementing a more general solution that caught all
`WorkerGroupCallback` errors and resurfaced them as `ControllerError`s,
but decided against it because:
* Callbacks occur in many different places and we might want to add
custom try/catch logic in each case.
* `after_worker_group_poll_status` is the only offender so far and most
of its errors are from user mistakes; other callback errors could be
legitimate bugs that should result in `ABORTED`

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants