[Train] Use failure policy to handle ControllerError#54833
[Train] Use failure policy to handle ControllerError#54833justinvyu merged 22 commits intoray-project:masterfrom
Conversation
Signed-off-by: xgui <xgui@anyscale.com>
There was a problem hiding this comment.
Summary of Changes
Hello @xinyuangui2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the fault tolerance capabilities of Ray Train by extending its failure policy to explicitly handle controller-level errors, in addition to existing worker-level training failures. It introduces a unified error type, WorkerOrControllerError, and refactors the core controller logic and failure handling mechanisms to leverage this new type, allowing for more granular and configurable recovery strategies for different types of failures.
Highlights
- Unified Error Handling: Introduced
WorkerOrControllerErroras a union type forTrainingFailedError(worker failures) andControllerError(controller-level failures), streamlining error propagation and handling across the Ray Train controller. - Expanded Failure Policy Scope: The
FailurePolicyand its default implementation (DefaultFailurePolicy) are now capable of making decisions based on both worker-related and controller-related errors, providing a more comprehensive fault tolerance system. - Controller Error Capture: The
_start_worker_groupmethod in the controller now catches all exceptions during worker group startup and wraps them asControllerError, ensuring that controller-level issues are properly integrated into the failure handling pipeline. - Configurable Controller Failure Limits: A new
controller_failure_limitparameter has been added toFailureConfig, allowing users to specify the maximum number of retries for controller-specific errors, separate from themax_failuresfor training errors. - State Management Updates: Various controller states (
ReschedulingState,RestartingState,ErroredState) and theTrainControllerLoopIterationResultnow consistently use the newWorkerOrControllerErrortype to store and propagate error information.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This PR refactors the failure handling logic to unify the handling of worker errors and controller errors under a new WorkerOrControllerError type. This is a good architectural improvement that makes the FailurePolicy more general. The changes are well-contained and include corresponding updates to tests. I've identified a few areas for improvement, including a potentially risky broad exception catch, a minor inconsistency in logging, and some opportunities for code simplification and improved design.
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
…lt.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
|
|
||
| class FailureDecision(Enum): | ||
| RESTART = "RESTART" | ||
| RESCHEDULE = "RESCHEDULE" |
There was a problem hiding this comment.
I still think we should keep FailureDecision agnostic to the controller decisions that happen afterwards. RETRY, RAISE, NOOP seem like the right level of abstraction.
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
😍 I love it! Thanks for working with me to iterate on this -- I think this is in a much better state than before this PR. Should be good to merge after this round!
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
…lt.py Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
A few small nits to fix, then I can merge 🚀
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/failure_handling/default.py
Outdated
Show resolved
Hide resolved
…lt.py Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
This PR includes several changes: - Exceptions - Introduce a `TrainingFailedError` type which is the union of `WorkerGroupError` and `ControllerError`. This new type will be handled by the FailurePolicy. - controller.py - `_start_worker_group` will catch all exceptions and wrap inside the `ControllerError`. The `ControllerError` is returned if not None. - `FailurePolicy` receives `TrainingFailedError` and return one `FailureDecision`. - `_execute_failure_decision` receives `FailureDecision` and `TrainingFailedError` and decide the next State. - DefaultFailurePolicy - For `ControllerError`, we check if that is one of the `RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise - For `TrainingFailedError`, we always retry if there is still budget. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
This PR includes several changes: - Exceptions - Introduce a `TrainingFailedError` type which is the union of `WorkerGroupError` and `ControllerError`. This new type will be handled by the FailurePolicy. - controller.py - `_start_worker_group` will catch all exceptions and wrap inside the `ControllerError`. The `ControllerError` is returned if not None. - `FailurePolicy` receives `TrainingFailedError` and return one `FailureDecision`. - `_execute_failure_decision` receives `FailureDecision` and `TrainingFailedError` and decide the next State. - DefaultFailurePolicy - For `ControllerError`, we check if that is one of the `RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise - For `TrainingFailedError`, we always retry if there is still budget. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
This PR includes several changes: - Exceptions - Introduce a `TrainingFailedError` type which is the union of `WorkerGroupError` and `ControllerError`. This new type will be handled by the FailurePolicy. - controller.py - `_start_worker_group` will catch all exceptions and wrap inside the `ControllerError`. The `ControllerError` is returned if not None. - `FailurePolicy` receives `TrainingFailedError` and return one `FailureDecision`. - `_execute_failure_decision` receives `FailureDecision` and `TrainingFailedError` and decide the next State. - DefaultFailurePolicy - For `ControllerError`, we check if that is one of the `RETRYABLE_CONTROLLER_ERRORS`. If not, directly raise - For `TrainingFailedError`, we always retry if there is still budget. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
This PR includes several changes:
Exceptions
TrainingFailedErrortype which is the union ofWorkerGroupErrorandControllerError. This new type will be handled by the FailurePolicy.controller.py
_start_worker_groupwill catch all exceptions and wrap inside theControllerError. TheControllerErroris returned if not None.FailurePolicyreceivesTrainingFailedErrorand return oneFailureDecision._execute_failure_decisionreceivesFailureDecisionandTrainingFailedErrorand decide the next State.DefaultFailurePolicy
ControllerError, we check if that is one of theRETRYABLE_CONTROLLER_ERRORS. If not, directly raiseTrainingFailedError, we always retry if there is still budget.What are things to improve
RetryableControllerErrorsinstead of using a dict to control it?Related issue number
Examples
Retryable Controller Error
Nonretryable Controller Error
Worker Group Error
The ray log order is messed up here (not related to this PR).
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.