Skip to content

[Train] Use failure policy to handle ControllerError#54833

Merged
justinvyu merged 22 commits intoray-project:masterfrom
xinyuangui2:policy-to-handle-resize-failure
Jul 28, 2025
Merged

[Train] Use failure policy to handle ControllerError#54833
justinvyu merged 22 commits intoray-project:masterfrom
xinyuangui2:policy-to-handle-resize-failure

Conversation

@xinyuangui2
Copy link
Copy Markdown
Contributor

@xinyuangui2 xinyuangui2 commented Jul 22, 2025

Why are these changes needed?

This PR includes several changes:

  • Exceptions

    • Introduce a TrainingFailedError type which is the union of WorkerGroupError and ControllerError. This new type will be handled by the FailurePolicy.
  • controller.py

    • _start_worker_group will catch all exceptions and wrap inside the ControllerError. The ControllerError is returned if not None.
    • FailurePolicy receives TrainingFailedError and return one FailureDecision.
    • _execute_failure_decision receives FailureDecision and TrainingFailedError and decide the next State.
  • DefaultFailurePolicy

    • For ControllerError, we check if that is one of the RETRYABLE_CONTROLLER_ERRORS. If not, directly raise
    • For TrainingFailedError, we always retry if there is still budget.

What are things to improve

  • Should we create one new error class RetryableControllerErrors instead of using a dict to control it?

Related issue number

Examples

Retryable Controller Error

image

Nonretryable Controller Error

image

Worker Group Error

The ray log order is messed up here (not related to this PR).

image

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
      • Running failures
      • retryable controller failure
      • non-retryable controller failure
    • This PR is not tested :(

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner July 22, 2025 18:20
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @xinyuangui2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the fault tolerance capabilities of Ray Train by extending its failure policy to explicitly handle controller-level errors, in addition to existing worker-level training failures. It introduces a unified error type, WorkerOrControllerError, and refactors the core controller logic and failure handling mechanisms to leverage this new type, allowing for more granular and configurable recovery strategies for different types of failures.

Highlights

  • Unified Error Handling: Introduced WorkerOrControllerError as a union type for TrainingFailedError (worker failures) and ControllerError (controller-level failures), streamlining error propagation and handling across the Ray Train controller.
  • Expanded Failure Policy Scope: The FailurePolicy and its default implementation (DefaultFailurePolicy) are now capable of making decisions based on both worker-related and controller-related errors, providing a more comprehensive fault tolerance system.
  • Controller Error Capture: The _start_worker_group method in the controller now catches all exceptions during worker group startup and wraps them as ControllerError, ensuring that controller-level issues are properly integrated into the failure handling pipeline.
  • Configurable Controller Failure Limits: A new controller_failure_limit parameter has been added to FailureConfig, allowing users to specify the maximum number of retries for controller-specific errors, separate from the max_failures for training errors.
  • State Management Updates: Various controller states (ReschedulingState, RestartingState, ErroredState) and the TrainControllerLoopIterationResult now consistently use the new WorkerOrControllerError type to store and propagate error information.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR refactors the failure handling logic to unify the handling of worker errors and controller errors under a new WorkerOrControllerError type. This is a good architectural improvement that makes the FailurePolicy more general. The changes are well-contained and include corresponding updates to tests. I've identified a few areas for improvement, including a potentially risky broad exception catch, a minor inconsistency in logging, and some opportunities for code simplification and improved design.

xinyuangui2 and others added 2 commits July 22, 2025 11:24
…lt.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>

class FailureDecision(Enum):
RESTART = "RESTART"
RESCHEDULE = "RESCHEDULE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should keep FailureDecision agnostic to the controller decisions that happen afterwards. RETRY, RAISE, NOOP seem like the right level of abstraction.

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu July 24, 2025 18:07
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner July 24, 2025 22:44
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 I love it! Thanks for working with me to iterate on this -- I think this is in a much better state than before this PR. Should be good to merge after this round!

xinyuangui2 and others added 8 commits July 25, 2025 12:38
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
…lt.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu July 26, 2025 03:00
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small nits to fix, then I can merge 🚀

Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu July 28, 2025 06:44
@justinvyu justinvyu enabled auto-merge (squash) July 28, 2025 17:25
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 28, 2025
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏 some final nits!

@justinvyu justinvyu disabled auto-merge July 28, 2025 17:28
xinyuangui2 and others added 3 commits July 28, 2025 10:34
…lt.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu July 28, 2025 18:17
@justinvyu justinvyu merged commit 127a5a5 into ray-project:master Jul 28, 2025
5 checks passed
krishnakalyan3 pushed a commit to krishnakalyan3/ray that referenced this pull request Jul 30, 2025
This PR includes several changes:

- Exceptions
- Introduce a `TrainingFailedError` type which is the union of
`WorkerGroupError` and `ControllerError`. This new type will be handled
by the FailurePolicy.

- controller.py
- `_start_worker_group` will catch all exceptions and wrap inside the
`ControllerError`. The `ControllerError` is returned if not None.
- `FailurePolicy` receives `TrainingFailedError` and return one
`FailureDecision`.
- `_execute_failure_decision` receives `FailureDecision` and
`TrainingFailedError` and decide the next State.
- DefaultFailurePolicy
- For `ControllerError`, we check if that is one of the
`RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise
  - For `TrainingFailedError`, we always retry if there is still budget.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
This PR includes several changes:

- Exceptions
- Introduce a `TrainingFailedError` type which is the union of
`WorkerGroupError` and `ControllerError`. This new type will be handled
by the FailurePolicy.

- controller.py
- `_start_worker_group` will catch all exceptions and wrap inside the
`ControllerError`. The `ControllerError` is returned if not None.
- `FailurePolicy` receives `TrainingFailedError` and return one
`FailureDecision`.
- `_execute_failure_decision` receives `FailureDecision` and
`TrainingFailedError` and decide the next State.
- DefaultFailurePolicy
- For `ControllerError`, we check if that is one of the
`RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise
  - For `TrainingFailedError`, we always retry if there is still budget.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
This PR includes several changes:

- Exceptions
- Introduce a `TrainingFailedError` type which is the union of
`WorkerGroupError` and `ControllerError`. This new type will be handled
by the FailurePolicy.

- controller.py
- `_start_worker_group` will catch all exceptions and wrap inside the
`ControllerError`. The `ControllerError` is returned if not None.
- `FailurePolicy` receives `TrainingFailedError` and return one
`FailureDecision`.
- `_execute_failure_decision` receives `FailureDecision` and
`TrainingFailedError` and decide the next State.
- DefaultFailurePolicy
- For `ControllerError`, we check if that is one of the
`RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise
  - For `TrainingFailedError`, we always retry if there is still budget.

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants