[Train] Use failure policy to handle ControllerError by xinyuangui2 · Pull Request #54833 · ray-project/ray

xinyuangui2 · 2025-07-22T18:20:35Z

Why are these changes needed?

This PR includes several changes:

Exceptions
- Introduce a TrainingFailedError type which is the union of WorkerGroupError and ControllerError. This new type will be handled by the FailurePolicy.
controller.py
- _start_worker_group will catch all exceptions and wrap inside the ControllerError. The ControllerError is returned if not None.
- FailurePolicy receives TrainingFailedError and return one FailureDecision.
- _execute_failure_decision receives FailureDecision and TrainingFailedError and decide the next State.
DefaultFailurePolicy
- For ControllerError, we check if that is one of the RETRYABLE_CONTROLLER_ERRORS. If not, directly raise
- For TrainingFailedError, we always retry if there is still budget.

What are things to improve

Should we create one new error class RetryableControllerErrors instead of using a dict to control it?

Related issue number

Examples

Retryable Controller Error

Nonretryable Controller Error

Worker Group Error

The ray log order is messed up here (not related to this PR).

Checks

Signed-off-by: xgui <xgui@anyscale.com>

gemini-code-assist

Summary of Changes

Hello @xinyuangui2, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the fault tolerance capabilities of Ray Train by extending its failure policy to explicitly handle controller-level errors, in addition to existing worker-level training failures. It introduces a unified error type, WorkerOrControllerError, and refactors the core controller logic and failure handling mechanisms to leverage this new type, allowing for more granular and configurable recovery strategies for different types of failures.

Highlights

Unified Error Handling: Introduced WorkerOrControllerError as a union type for TrainingFailedError (worker failures) and ControllerError (controller-level failures), streamlining error propagation and handling across the Ray Train controller.
Expanded Failure Policy Scope: The FailurePolicy and its default implementation (DefaultFailurePolicy) are now capable of making decisions based on both worker-related and controller-related errors, providing a more comprehensive fault tolerance system.
Controller Error Capture: The _start_worker_group method in the controller now catches all exceptions during worker group startup and wraps them as ControllerError, ensuring that controller-level issues are properly integrated into the failure handling pipeline.
Configurable Controller Failure Limits: A new controller_failure_limit parameter has been added to FailureConfig, allowing users to specify the maximum number of retries for controller-specific errors, separate from the max_failures for training errors.
State Management Updates: Various controller states (ReschedulingState, RestartingState, ErroredState) and the TrainControllerLoopIterationResult now consistently use the new WorkerOrControllerError type to store and propagate error information.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR refactors the failure handling logic to unify the handling of worker errors and controller errors under a new WorkerOrControllerError type. This is a good architectural improvement that makes the FailurePolicy more general. The changes are well-contained and include corresponding updates to tests. I've identified a few areas for improvement, including a potentially risky broad exception catch, a minor inconsistency in logging, and some opportunities for code simplification and improved design.

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/_internal/execution/failure_handling/default.py

…lt.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/train/v2/api/exceptions.py

justinvyu · 2025-07-23T20:46:36Z

python/ray/train/v2/_internal/execution/failure_handling/failure_policy.py


 class FailureDecision(Enum):
    RESTART = "RESTART"
+    RESCHEDULE = "RESCHEDULE"


I still think we should keep FailureDecision agnostic to the controller decisions that happen afterwards. RETRY, RAISE, NOOP seem like the right level of abstraction.

python/ray/train/v2/_internal/execution/failure_handling/default.py

Signed-off-by: xgui <xgui@anyscale.com>

justinvyu

😍 I love it! Thanks for working with me to iterate on this -- I think this is in a much better state than before this PR. Should be good to merge after this round!

python/ray/train/v2/_internal/execution/failure_handling/default.py

python/ray/train/v2/api/config.py

python/ray/train/v2/_internal/execution/failure_handling/default.py

python/ray/train/v2/tests/test_failure_policy.py

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/api/config.py

python/ray/train/v2/tests/test_failure_policy.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

…lt.py Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

justinvyu

A few small nits to fix, then I can merge 🚀

python/ray/train/v2/tests/test_failure_policy.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

justinvyu

🙏 some final nits!

python/ray/train/v2/_internal/execution/failure_handling/default.py

…lt.py Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

This PR includes several changes: - Exceptions - Introduce a `TrainingFailedError` type which is the union of `WorkerGroupError` and `ControllerError`. This new type will be handled by the FailurePolicy. - controller.py - `_start_worker_group` will catch all exceptions and wrap inside the `ControllerError`. The `ControllerError` is returned if not None. - `FailurePolicy` receives `TrainingFailedError` and return one `FailureDecision`. - `_execute_failure_decision` receives `FailureDecision` and `TrainingFailedError` and decide the next State. - DefaultFailurePolicy - For `ControllerError`, we check if that is one of the `RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise - For `TrainingFailedError`, we always retry if there is still budget. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>

This PR includes several changes: - Exceptions - Introduce a `TrainingFailedError` type which is the union of `WorkerGroupError` and `ControllerError`. This new type will be handled by the FailurePolicy. - controller.py - `_start_worker_group` will catch all exceptions and wrap inside the `ControllerError`. The `ControllerError` is returned if not None. - `FailurePolicy` receives `TrainingFailedError` and return one `FailureDecision`. - `_execute_failure_decision` receives `FailureDecision` and `TrainingFailedError` and decide the next State. - DefaultFailurePolicy - For `ControllerError`, we check if that is one of the `RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise - For `TrainingFailedError`, we always retry if there is still budget. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

This PR includes several changes: - Exceptions - Introduce a `TrainingFailedError` type which is the union of `WorkerGroupError` and `ControllerError`. This new type will be handled by the FailurePolicy. - controller.py - `_start_worker_group` will catch all exceptions and wrap inside the `ControllerError`. The `ControllerError` is returned if not None. - `FailurePolicy` receives `TrainingFailedError` and return one `FailureDecision`. - `_execute_failure_decision` receives `FailureDecision` and `TrainingFailedError` and decide the next State. - DefaultFailurePolicy - For `ControllerError`, we check if that is one of the `RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise - For `TrainingFailedError`, we always retry if there is still budget. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

xinyuangui2 added 2 commits July 22, 2025 07:25

controller error handler

63c651d

Signed-off-by: xgui <xgui@anyscale.com>

update

fd92b4c

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from a team as a code owner July 22, 2025 18:20

xinyuangui2 requested review from justinvyu and matthewdeng July 22, 2025 18:20

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

xinyuangui2 mentioned this pull request Jul 22, 2025

[train] Use FailurePolicy to handle resize failure #54257

Closed

8 tasks

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

python/ray/train/v2/_internal/execution/failure_handling/default.py Outdated Show resolved Hide resolved

xinyuangui2 and others added 2 commits July 22, 2025 11:24

Update python/ray/train/v2/_internal/execution/failure_handling/defau…

544e0ac

…lt.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

fix unittest

4a91244

Signed-off-by: xgui <xgui@anyscale.com>

justinvyu reviewed Jul 23, 2025

View reviewed changes

xinyuangui2 added 3 commits July 24, 2025 01:30

pending resolving comments

4f3c3d9

Signed-off-by: xgui <xgui@anyscale.com>

resolve comments

207ddef

Signed-off-by: xgui <xgui@anyscale.com>

fix unittest

2ef9fcd

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from justinvyu July 24, 2025 18:07

xinyuangui2 added 2 commits July 24, 2025 19:07

fix build

e39e5e9

Signed-off-by: xgui <xgui@anyscale.com>

update doc

d50a96c

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from a team as a code owner July 24, 2025 22:44

justinvyu reviewed Jul 25, 2025

View reviewed changes

xinyuangui2 and others added 8 commits July 25, 2025 12:38

Update python/ray/train/v2/tests/test_failure_policy.py

81e0e24

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Update python/ray/train/v2/tests/test_failure_policy.py

cad2f35

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Update python/ray/train/v2/tests/test_failure_policy.py

eb9037c

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Update python/ray/train/v2/api/config.py

618a2b2

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

resolve comments

a40e6ff

Signed-off-by: xgui <xgui@anyscale.com>

fix unittests

0859c51

Signed-off-by: xgui <xgui@anyscale.com>

Update python/ray/train/v2/_internal/execution/failure_handling/defau…

f620d3f

…lt.py Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

add error type

3f1bd74

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from justinvyu July 26, 2025 03:00

justinvyu approved these changes Jul 28, 2025

View reviewed changes

python/ray/train/v2/tests/test_failure_policy.py Outdated Show resolved Hide resolved

python/ray/train/v2/tests/test_failure_policy.py Outdated Show resolved Hide resolved

Update python/ray/train/v2/tests/test_failure_policy.py

e828f5c

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

fix comments

980ebb3

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from justinvyu July 28, 2025 06:44

justinvyu enabled auto-merge (squash) July 28, 2025 17:25

github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 28, 2025

justinvyu reviewed Jul 28, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/failure_handling/default.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/failure_handling/default.py Outdated Show resolved Hide resolved

justinvyu disabled auto-merge July 28, 2025 17:28

xinyuangui2 and others added 3 commits July 28, 2025 10:34

Update python/ray/train/v2/_internal/execution/failure_handling/defau…

e02828c

…lt.py Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

resolve comments

df9b586

Signed-off-by: xgui <xgui@anyscale.com>

Merge branch 'master' into policy-to-handle-resize-failure

5a2639d

xinyuangui2 requested a review from justinvyu July 28, 2025 18:17

justinvyu merged commit 127a5a5 into ray-project:master Jul 28, 2025
5 checks passed

Conversation

xinyuangui2 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

What are things to improve

Related issue number

Examples

Retryable Controller Error

Nonretryable Controller Error

Worker Group Error

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xinyuangui2 commented Jul 22, 2025 •

edited

Loading