Skip to content

[train] Use FailurePolicy to handle resize failure#54257

Closed
xinyuangui2 wants to merge 24 commits intoray-project:masterfrom
xinyuangui2:xgui/handle-resize-failure
Closed

[train] Use FailurePolicy to handle resize failure#54257
xinyuangui2 wants to merge 24 commits intoray-project:masterfrom
xinyuangui2:xgui/handle-resize-failure

Conversation

@xinyuangui2
Copy link
Copy Markdown
Contributor

@xinyuangui2 xinyuangui2 commented Jul 1, 2025

Why are these changes needed?

This PR enables the FailurePolicy to handle worker group resize/startup failures, instead of retrying indefinitely. Previously, startup errors (WorkerGroupStartupTimeoutError, WorkerGroupStartupFailedError) would always retry without limit, ignoring the configured failure policy.

Changes:

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xinyuangui2 xinyuangui2 force-pushed the xgui/handle-resize-failure branch from 04dd54e to bfb7e42 Compare July 1, 2025 21:05
@xinyuangui2 xinyuangui2 marked this pull request as ready for review July 2, 2025 00:45
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner July 2, 2025 00:45
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice so far! 😎

Copy link
Copy Markdown
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty cool!

Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change TrainingFailedError to also accept the controller-level worker group scheduling error, and I think it's better to actually decouple the 2 status classes. cc @matthewdeng

Also, note that this PR adds a new state controller transition from SCHEDULING -> ERRORED. (previously it only goes from RUNNING -> ERRORED.

@xinyuangui2
Copy link
Copy Markdown
Contributor Author

We need to change TrainingFailedError to also accept the controller-level worker group scheduling error, and I think it's better to actually decouple the 2 status classes. cc @matthewdeng

Also, note that this PR adds a new state controller transition from SCHEDULING -> ERRORED. (previously it only goes from RUNNING -> ERRORED.

Good call. For now I am hacking by setting it to {0: controller_error}. Added a TODO on it.

@cszhu cszhu added community-contribution Contributed by the community train Ray Train Related Issue labels Jul 10, 2025
Copy link
Copy Markdown
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good!

Comment on lines 226 to 239
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation there is a 1:1 mapping between TrainingFailedError:RESTART and ControllerError:RESCHEDULE, but I don't think we want to enforce this - generally the FailurePolicy should own the entirety of the decision making/validation logic. I can imagine that in the future there are cases where ControllerError can return RESTART as well.

@xinyuangui2 xinyuangui2 force-pushed the xgui/handle-resize-failure branch 3 times, most recently from 60c463c to c8b7472 Compare July 21, 2025 19:56
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 force-pushed the xgui/handle-resize-failure branch from c8b7472 to 73e1f5f Compare July 21, 2025 22:57
justinvyu pushed a commit that referenced this pull request Jul 22, 2025
…#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see #54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
dshepelev15 pushed a commit to dshepelev15/ray that referenced this pull request Jul 22, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: dshepelev15 <d-shepelev@list.ru>
@xinyuangui2
Copy link
Copy Markdown
Contributor Author

Moved to #54833

alimaazamat pushed a commit to alimaazamat/ray that referenced this pull request Jul 24, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
krishnakalyan3 pushed a commit to krishnakalyan3/ray that referenced this pull request Jul 30, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants