Skip to content

[Train] Add ControllerError for the errors thrown from the controller#54801

Merged
justinvyu merged 2 commits intoray-project:masterfrom
xinyuangui2:add-controller-exception
Jul 22, 2025
Merged

[Train] Add ControllerError for the errors thrown from the controller#54801
justinvyu merged 2 commits intoray-project:masterfrom
xinyuangui2:add-controller-exception

Conversation

@xinyuangui2
Copy link
Copy Markdown
Contributor

@xinyuangui2 xinyuangui2 commented Jul 21, 2025

Why are these changes needed?

The controller can raise a variety of errors. We distinguish these through two subclasses of RayTrainError:

  • TrainingFailedError captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
  • ControllerError captures Train driver errors (the TrainController). For example, if there are too many worker group startup attempts that fail (see [train] Use FailurePolicy to handle resize failure #54257), then the controller can error out.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu July 21, 2025 23:00
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner July 21, 2025 23:00
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @xinyuangui2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances Ray Train's error handling by introducing a dedicated exception type for controller-specific failures and refining the documentation for worker-related failures. These changes improve the granularity of error reporting and are foundational for future FailurePolicy implementations, while also ensuring that these custom exceptions are properly serializable for distributed operations.

Highlights

  • New Error Type for Controller Failures: I've introduced a new exception class, ControllerError, which inherits from RayTrainError. This class is specifically designed to be raised when training fails due to an error originating from the Ray Train controller, providing a clearer distinction from worker-related failures.
  • Refined TrainingFailedError Documentation: I've updated the docstring for the existing TrainingFailedError to explicitly state that it's raised for exceptions from training workers. The docstring now also clearly documents its error_message and worker_failures arguments, improving clarity for users.
  • Ensured Exception Picklability: I've added a new unit test (test_exceptions_are_picklable) to verify that both TrainingFailedError and the new ControllerError can be successfully serialized (pickled) and deserialized (unpickled). This is crucial for these custom exceptions to be reliably passed and handled across different processes in a distributed Ray environment.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ControllerError exception to distinguish controller-level failures from worker failures, which are handled by TrainingFailedError. The docstring for TrainingFailedError is also improved for clarity. A comprehensive test is added to ensure both exception types are picklable, which is crucial in a distributed environment. The changes are logical and well-tested. I have one suggestion to improve code style consistency.

Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested a review from justinvyu July 21, 2025 23:31
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@justinvyu justinvyu enabled auto-merge (squash) July 21, 2025 23:41
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 21, 2025
@justinvyu justinvyu merged commit bda0829 into ray-project:master Jul 22, 2025
6 of 7 checks passed
dshepelev15 pushed a commit to dshepelev15/ray that referenced this pull request Jul 22, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: dshepelev15 <d-shepelev@list.ru>
alimaazamat pushed a commit to alimaazamat/ray that referenced this pull request Jul 24, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
krishnakalyan3 pushed a commit to krishnakalyan3/ray that referenced this pull request Jul 30, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…ray-project#54801)

The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`:
* `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
* `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out.
---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants