[Train] Add ControllerError for the errors thrown from the controller#54801
[Train] Add ControllerError for the errors thrown from the controller#54801justinvyu merged 2 commits intoray-project:masterfrom
Conversation
Signed-off-by: xgui <xgui@anyscale.com>
There was a problem hiding this comment.
Summary of Changes
Hello @xinyuangui2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances Ray Train's error handling by introducing a dedicated exception type for controller-specific failures and refining the documentation for worker-related failures. These changes improve the granularity of error reporting and are foundational for future FailurePolicy implementations, while also ensuring that these custom exceptions are properly serializable for distributed operations.
Highlights
- New Error Type for Controller Failures: I've introduced a new exception class,
ControllerError, which inherits fromRayTrainError. This class is specifically designed to be raised when training fails due to an error originating from the Ray Train controller, providing a clearer distinction from worker-related failures. - Refined
TrainingFailedErrorDocumentation: I've updated the docstring for the existingTrainingFailedErrorto explicitly state that it's raised for exceptions from training workers. The docstring now also clearly documents itserror_messageandworker_failuresarguments, improving clarity for users. - Ensured Exception Picklability: I've added a new unit test (
test_exceptions_are_picklable) to verify that bothTrainingFailedErrorand the newControllerErrorcan be successfully serialized (pickled) and deserialized (unpickled). This is crucial for these custom exceptions to be reliably passed and handled across different processes in a distributed Ray environment.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a new ControllerError exception to distinguish controller-level failures from worker failures, which are handled by TrainingFailedError. The docstring for TrainingFailedError is also improved for clarity. A comprehensive test is added to ensure both exception types are picklable, which is crucial in a distributed environment. The changes are logical and well-tested. I have one suggestion to improve code style consistency.
Signed-off-by: xgui <xgui@anyscale.com>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: dshepelev15 <d-shepelev@list.ru>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…ray-project#54801) The controller can raise a variety of errors. We distinguish these through two subclasses of `RayTrainError`: * `TrainingFailedError` captures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error. * `ControllerError` captures Train driver errors (the `TrainController`). For example, if there are too many worker group startup attempts that fail (see ray-project#54257), then the controller can error out. --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
The controller can raise a variety of errors. We distinguish these through two subclasses of
RayTrainError:TrainingFailedErrorcaptures Train worker failures. If any of the workers failed, then this error is populated with a dict mapping worker rank to the error.ControllerErrorcaptures Train driver errors (theTrainController). For example, if there are too many worker group startup attempts that fail (see [train] Use FailurePolicy to handle resize failure #54257), then the controller can error out.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.