[Train] Add ControllerError for the errors thrown from the controller by xinyuangui2 · Pull Request #54801 · ray-project/ray

xinyuangui2 · 2025-07-21T23:00:52Z

Why are these changes needed?

The controller can raise a variety of errors. We distinguish these through two subclasses of RayTrainError:

TrainingFailedError captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.
ControllerError captures Train driver errors (the TrainController). For example, if there are too many worker group startup attempts that fail (see [train] Use FailurePolicy to handle resize failure #54257), then the controller can error out.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: xgui <xgui@anyscale.com>

gemini-code-assist

Summary of Changes

Hello @xinyuangui2, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances Ray Train's error handling by introducing a dedicated exception type for controller-specific failures and refining the documentation for worker-related failures. These changes improve the granularity of error reporting and are foundational for future FailurePolicy implementations, while also ensuring that these custom exceptions are properly serializable for distributed operations.

Highlights

New Error Type for Controller Failures: I've introduced a new exception class, ControllerError, which inherits from RayTrainError. This class is specifically designed to be raised when training fails due to an error originating from the Ray Train controller, providing a clearer distinction from worker-related failures.
Refined TrainingFailedError Documentation: I've updated the docstring for the existing TrainingFailedError to explicitly state that it's raised for exceptions from training workers. The docstring now also clearly documents its error_message and worker_failures arguments, improving clarity for users.
Ensured Exception Picklability: I've added a new unit test (test_exceptions_are_picklable) to verify that both TrainingFailedError and the new ControllerError can be successfully serialized (pickled) and deserialized (unpickled). This is crucial for these custom exceptions to be reliably passed and handled across different processes in a distributed Ray environment.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new ControllerError exception to distinguish controller-level failures from worker failures, which are handled by TrainingFailedError. The docstring for TrainingFailedError is also improved for clarity. A comprehensive test is added to ensure both exception types are picklable, which is crucial in a distributed environment. The changes are logical and well-tested. I have one suggestion to improve code style consistency.

python/ray/train/v2/api/exceptions.py

python/ray/train/v2/tests/test_v2_api.py

Signed-off-by: xgui <xgui@anyscale.com>

justinvyu

thanks!

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: dshepelev15 <d-shepelev@list.ru>

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

Add controller exception

d58b571

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from justinvyu July 21, 2025 23:00

xinyuangui2 requested a review from a team as a code owner July 21, 2025 23:00

gemini-code-assist bot reviewed Jul 21, 2025

View reviewed changes

python/ray/train/v2/api/exceptions.py Show resolved Hide resolved

xinyuangui2 mentioned this pull request Jul 21, 2025

[Train] Add ControllerError for the errors thrown from the controller #54633

Closed

8 tasks

justinvyu reviewed Jul 21, 2025

View reviewed changes

python/ray/train/v2/tests/test_v2_api.py Outdated Show resolved Hide resolved

python/ray/train/v2/tests/test_v2_api.py Outdated Show resolved Hide resolved

python/ray/train/v2/tests/test_v2_api.py Outdated Show resolved Hide resolved

fix comments

19a24a7

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from justinvyu July 21, 2025 23:31

justinvyu approved these changes Jul 21, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) July 21, 2025 23:41

github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 21, 2025

justinvyu merged commit bda0829 into ray-project:master Jul 22, 2025
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Add ControllerError for the errors thrown from the controller#54801

[Train] Add ControllerError for the errors thrown from the controller#54801
justinvyu merged 2 commits intoray-project:masterfrom
xinyuangui2:add-controller-exception

xinyuangui2 commented Jul 21, 2025 •

edited by justinvyu

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xinyuangui2 commented Jul 21, 2025 • edited by justinvyu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xinyuangui2 commented Jul 21, 2025 •

edited by justinvyu

Loading