[train] Fix the xgboost v2 callback#54787
Conversation
Signed-off-by: Lehui Liu <lehui@anyscale.com>
There was a problem hiding this comment.
Summary of Changes
Hello @liulehui, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request refines the XGBoost v2 callback within Ray Train to ensure its compatibility and correct functionality when used exclusively with Ray Tune. The changes aim to decouple ray.train and ray.tune dependencies, allowing the callback to adapt its reporting mechanism based on the detected execution environment.
Highlights
- Decoupling Train/Tune: This pull request is part of a broader initiative to decouple
ray.trainandray.tunedependencies, specifically addressing the XGBoost callback's behavior when used in a Tune-only context. - Conditional Reporting Logic: The
XGBoostCallbacksnow intelligently route metric and checkpoint reporting. Depending on whether the code is executing within aray.tunesession, it will conditionally callray.tune.reportorray.train.report. - Tune Session Detection: To facilitate the conditional reporting, the
_in_tune_sessionutility function fromray.tune.trainable.trainable_fn_utilshas been imported and integrated to reliably determine the current execution context. - Improved Checkpoint Handling: The
_get_checkpointmethod's logic has been enhanced to correctly manage checkpointing when running in aray.tuneonly environment, complementing the existingray.trainworld rank checks.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request aims to adapt the XGBoost callback for "Tune-only" usage by conditionally selecting between ray.tune.report and ray.train.report. The changes look good, but there's a critical bug in the after_iteration method where a function is not being called correctly, which would lead to incorrect behavior. I've also included a suggestion to refactor a similar block of code to improve maintainability.
Signed-off-by: Lehui Liu <lehui@anyscale.com>
1. make `ray.train.xgboost.RayTrainReportCallback` and `ray.tune.integration.xgboost.TuneReportCheckpointCall` implement it 2. each subclass define its own report_fn and get_checkpoint method 3. common logic stays in the abstract class Signed-off-by: Lehui Liu <lehui@anyscale.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
1 similar comment
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com>
1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>
1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: #54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>
Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: #54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
_is_tune_session()to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example here.Related issue number
NA
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.