[train] Use FailurePolicy to handle resize failure by xinyuangui2 · Pull Request #54257 · ray-project/ray

xinyuangui2 · 2025-07-01T20:47:04Z

Why are these changes needed?

This PR enables the FailurePolicy to handle worker group resize/startup failures, instead of retrying indefinitely. Previously, startup errors (WorkerGroupStartupTimeoutError, WorkerGroupStartupFailedError) would always retry without limit, ignoring the configured failure policy.

Changes:

Use the new ControllerError defined in [Train] Add ControllerError for the errors thrown from the controller #54633
For FailurePolicy and _execute_failure_decision, they will receive either training_failed_error or controller_failed_error. These 2 parameters would help them decide the next state.
In FailurePolicy, handle both training_failed_error and controller_failed_error.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

justinvyu

nice so far! 😎

python/ray/train/v2/api/config.py

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/_internal/execution/failure_handling/default.py

python/ray/train/v2/_internal/execution/worker_group/state.py

python/ray/train/v2/_internal/execution/worker_group/protocol.py

python/ray/train/v2/_internal/execution/worker_group/state.py

python/ray/train/v2/tests/test_failure_policy.py

python/ray/train/v2/_internal/execution/failure_handling/default.py

matthewdeng

pretty cool!

python/ray/train/v2/_internal/execution/failure_handling/default.py

python/ray/train/v2/_internal/execution/worker_group/protocol.py

python/ray/train/v2/_internal/execution/worker_group/state.py

python/ray/train/v2/api/config.py

justinvyu

We need to change TrainingFailedError to also accept the controller-level worker group scheduling error, and I think it's better to actually decouple the 2 status classes. cc @matthewdeng

Also, note that this PR adds a new state controller transition from SCHEDULING -> ERRORED. (previously it only goes from RUNNING -> ERRORED.

python/ray/train/v2/_internal/execution/worker_group/protocol.py

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/_internal/execution/failure_handling/default.py

xinyuangui2 · 2025-07-09T19:51:13Z

We need to change TrainingFailedError to also accept the controller-level worker group scheduling error, and I think it's better to actually decouple the 2 status classes. cc @matthewdeng

Also, note that this PR adds a new state controller transition from SCHEDULING -> ERRORED. (previously it only goes from RUNNING -> ERRORED.

Good call. For now I am hacking by setting it to {0: controller_error}. Added a TODO on it.

python/ray/train/v2/_internal/execution/failure_handling/failure_policy.py

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/_internal/execution/failure_handling/default.py

matthewdeng

Looking pretty good!

python/ray/train/v2/_internal/execution/failure_handling/failure_policy.py

python/ray/train/v2/api/config.py

python/ray/train/v2/_internal/execution/failure_handling/default.py

python/ray/train/v2/_internal/execution/worker_group/poll.py

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/_internal/execution/worker_group/state.py

matthewdeng · 2025-07-20T01:08:46Z

python/ray/train/v2/_internal/execution/controller/controller.py

In the current implementation there is a 1:1 mapping between TrainingFailedError:RESTART and ControllerError:RESCHEDULE, but I don't think we want to enforce this - generally the FailurePolicy should own the entirety of the decision making/validation logic. I can imagine that in the future there are cases where ControllerError can return RESTART as well.

Signed-off-by: xgui <xgui@anyscale.com>

…#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see #54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com>

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: dshepelev15 <d-shepelev@list.ru>

xinyuangui2 · 2025-07-22T18:21:36Z

Moved to #54833

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

xinyuangui2 force-pushed the xgui/handle-resize-failure branch from 04dd54e to bfb7e42 Compare July 1, 2025 21:05

xinyuangui2 marked this pull request as ready for review July 2, 2025 00:45

xinyuangui2 requested a review from a team as a code owner July 2, 2025 00:45

justinvyu reviewed Jul 2, 2025

View reviewed changes

matthewdeng reviewed Jul 2, 2025

View reviewed changes

xinyuangui2 requested review from justinvyu and matthewdeng July 7, 2025 18:50

justinvyu reviewed Jul 8, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/protocol.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/controller/controller.py Outdated Show resolved Hide resolved

matthewdeng reviewed Jul 8, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/failure_handling/default.py Outdated Show resolved Hide resolved

xinyuangui2 requested review from justinvyu and matthewdeng July 9, 2025 19:54

cszhu added community-contribution Contributed by the community train Ray Train Related Issue labels Jul 10, 2025

xinyuangui2 mentioned this pull request Jul 10, 2025

[Train] fail fast if pg can never be met #54402

Merged

8 tasks

TimothySeah reviewed Jul 14, 2025

View reviewed changes

xinyuangui2 removed request for justinvyu and matthewdeng July 15, 2025 00:32

xinyuangui2 mentioned this pull request Jul 18, 2025

[Train] Add ControllerError for the errors thrown from the controller #54633

Closed

8 tasks

xinyuangui2 force-pushed the xgui/handle-resize-failure branch from 08048e6 to 44c25d6 Compare July 18, 2025 19:47

matthewdeng reviewed Jul 20, 2025

View reviewed changes

xinyuangui2 force-pushed the xgui/handle-resize-failure branch 3 times, most recently from 60c463c to c8b7472 Compare July 21, 2025 19:56

xinyuangui2 added 6 commits July 21, 2025 22:56

add WorkerTrainingFailedError and SchedulingTrainingFailedError

d46d39d

Signed-off-by: xgui <xgui@anyscale.com>

define a new status that can be handled by policy

1c956fd

Signed-off-by: xgui <xgui@anyscale.com>

begin testing

b50b079

Signed-off-by: xgui <xgui@anyscale.com>

lint

7235977

Signed-off-by: xgui <xgui@anyscale.com>

lint

b765b72

Signed-off-by: xgui <xgui@anyscale.com>

sign off

b6c6c38

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 added 8 commits July 21, 2025 22:56

clean the api

1b8c412

Signed-off-by: xgui <xgui@anyscale.com>

sanity check

883525a

Signed-off-by: xgui <xgui@anyscale.com>

fix assert

b13dced

Signed-off-by: xgui <xgui@anyscale.com>

use controller failure

8c0f33a

Signed-off-by: xgui <xgui@anyscale.com>

fix get_error_string

19ec6fb

Signed-off-by: xgui <xgui@anyscale.com>

refactor using the trainingfailederror and controllerfailederror

d8cfe05

Signed-off-by: xgui <xgui@anyscale.com>

use union instead of 2 optionals

78121ed

Signed-off-by: xgui <xgui@anyscale.com>

fix unittests

73e1f5f

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 force-pushed the xgui/handle-resize-failure branch from c8b7472 to 73e1f5f Compare July 21, 2025 22:57

xinyuangui2 requested review from a team, SongGuyang, WangTaoTheTonic, kfstorm and raulchen as code owners July 21, 2025 22:57

xinyuangui2 closed this Jul 21, 2025

justinvyu mentioned this pull request Jul 21, 2025

[Train] Add ControllerError for the errors thrown from the controller #54801

Merged

8 tasks

Conversation

xinyuangui2 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinyuangui2 commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xinyuangui2 commented Jul 1, 2025 •

edited

Loading