[train] Fix the xgboost v2 callback by liulehui · Pull Request #54787 · ray-project/ray

liulehui · 2025-07-21T17:46:26Z

Why are these changes needed?

in the ray train revamp REP, we decouple the ray train/ray tune dependency.
Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this line, the v2 context api will throw RuntimeError.
In V1 this issue is mitigated by switch to Tune Context when train.get_context() is called.
In order to make the xgboost tune only usage callback continue working, hence the bypass the use _is_tune_session() to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example here.

Related issue number

NA

Checks

[x ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- tested locally
- Release tests
- This PR is not tested :(

Signed-off-by: Lehui Liu <lehui@anyscale.com>

gemini-code-assist

Summary of Changes

Hello @liulehui, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the XGBoost v2 callback within Ray Train to ensure its compatibility and correct functionality when used exclusively with Ray Tune. The changes aim to decouple ray.train and ray.tune dependencies, allowing the callback to adapt its reporting mechanism based on the detected execution environment.

Highlights

Decoupling Train/Tune: This pull request is part of a broader initiative to decouple ray.train and ray.tune dependencies, specifically addressing the XGBoost callback's behavior when used in a Tune-only context.
Conditional Reporting Logic: The XGBoostCallbacks now intelligently route metric and checkpoint reporting. Depending on whether the code is executing within a ray.tune session, it will conditionally call ray.tune.report or ray.train.report.
Tune Session Detection: To facilitate the conditional reporting, the _in_tune_session utility function from ray.tune.trainable.trainable_fn_utils has been imported and integrated to reliably determine the current execution context.
Improved Checkpoint Handling: The _get_checkpoint method's logic has been enhanced to correctly manage checkpointing when running in a ray.tune only environment, complementing the existing ray.train world rank checks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to adapt the XGBoost callback for "Tune-only" usage by conditionally selecting between ray.tune.report and ray.train.report. The changes look good, but there's a critical bug in the after_iteration method where a function is not being called correctly, which would lead to incorrect behavior. I've also included a suggestion to refactor a similar block of code to improve maintainability.

python/ray/train/xgboost/_xgboost_utils.py

Signed-off-by: Lehui Liu <lehui@anyscale.com>

python/ray/train/xgboost/_xgboost_utils.py

1. make `ray.train.xgboost.RayTrainReportCallback` and `ray.tune.integration.xgboost.TuneReportCheckpointCall` implement it 2. each subclass define its own report_fn and get_checkpoint method 3. common logic stays in the abstract class Signed-off-by: Lehui Liu <lehui@anyscale.com>

gemini-code-assist · 2025-07-23T00:07:47Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-07-23T00:09:30Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/ray/train/xgboost/_xgboost_utils.py

python/ray/tune/integration/xgboost.py

Signed-off-by: Lehui Liu <lehui@anyscale.com>

ci/lint/pydoclint-baseline.txt

python/ray/train/xgboost/_xgboost_utils.py

python/ray/tune/integration/xgboost.py

Signed-off-by: Lehui Liu <lehui@anyscale.com>

python/ray/train/xgboost/_xgboost_utils.py

Signed-off-by: Lehui Liu <lehui@anyscale.com>

justinvyu

Thanks for the fix! 🚀

Signed-off-by: Lehui Liu <lehui@anyscale.com>

ci/lint/pydoclint-baseline.txt

1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com>

1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>

1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>

1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: #54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>

Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: #54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

1. in the ray train [revamp REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage), we decouple the ray train/ray tune dependency. 2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this [line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170), the v2 context api will throw [RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283). 3. In V1 this issue is mitigated by [switch to Tune Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128) when train.get_context() is called. 4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use `_is_tune_session()` to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example [here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage). Signed-off-by: Lehui Liu <lehui@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Flip the flag for Tune doctest CI in preparation for turning on Train V2 by default. This doesn't have any behavior change, but this asserts that ray.train -> ray.tune updates have all happened. Note that a few tests have been left behind due to Tune lightgbm and Keras callbacks not being updated yet. We need to do the equivalent of this PR: ray-project#54787 * `lightgbm_example` * `lightgbm_example_cv` * `tune_mnist_keras` Deletes `horovod_simple.ipynb` example because we don't support `HorovodTrainer` anymore. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

fix the xgboost v2 callback

7e9e551

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui requested a review from a team as a code owner July 21, 2025 17:46

gemini-code-assist bot reviewed Jul 21, 2025

View reviewed changes

python/ray/train/xgboost/_xgboost_utils.py Outdated Show resolved Hide resolved

python/ray/train/xgboost/_xgboost_utils.py Outdated Show resolved Hide resolved

python/ray/train/xgboost/_xgboost_utils.py Outdated Show resolved Hide resolved

Fix the xgboost v2 callback

5bedd6a

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui changed the title ~~fix the xgboost v2 callback~~ [train] Fix the xgboost v2 callback Jul 21, 2025

liulehui self-assigned this Jul 21, 2025

matthewdeng reviewed Jul 21, 2025

View reviewed changes

python/ray/train/xgboost/_xgboost_utils.py Show resolved Hide resolved

matthewdeng reviewed Jul 21, 2025

View reviewed changes

python/ray/train/xgboost/_xgboost_utils.py Outdated Show resolved Hide resolved

python/ray/train/xgboost/_xgboost_utils.py Show resolved Hide resolved

liulehui added 2 commits July 22, 2025 13:42

Merge branch 'master' into fix-v2-callback

53c5d20

liulehui requested a review from a team as a code owner July 23, 2025 00:06

justinvyu reviewed Jul 25, 2025

View reviewed changes

liulehui added 3 commits July 25, 2025 15:05

resolve comment

6962ca4

Signed-off-by: Lehui Liu <lehui@anyscale.com>

remove unused report_fn

a8b8108

Signed-off-by: Lehui Liu <lehui@anyscale.com>

Merge branch 'master' into fix-v2-callback

23ea5ca

justinvyu reviewed Jul 29, 2025

View reviewed changes

ci/lint/pydoclint-baseline.txt Show resolved Hide resolved

python/ray/train/xgboost/_xgboost_utils.py Outdated Show resolved Hide resolved

python/ray/tune/integration/xgboost.py Outdated Show resolved Hide resolved

liulehui and others added 4 commits July 29, 2025 13:41

Merge branch 'ray-project:master' into fix-v2-callback

9ef6f46

revert back the lint file

57e144f

Signed-off-by: Lehui Liu <lehui@anyscale.com>

resolve comments

1189214

Signed-off-by: Lehui Liu <lehui@anyscale.com>

add back the pydoclint

94259e5

Signed-off-by: Lehui Liu <lehui@anyscale.com>

matthewdeng reviewed Jul 29, 2025

View reviewed changes

python/ray/train/xgboost/_xgboost_utils.py Outdated Show resolved Hide resolved

python/ray/train/xgboost/_xgboost_utils.py Show resolved Hide resolved

raise NotImplementedError

5d09a4e

Signed-off-by: Lehui Liu <lehui@anyscale.com>

justinvyu approved these changes Jul 29, 2025

View reviewed changes

liulehui added 4 commits July 29, 2025 15:18

Merge branch 'master' into fix-v2-callback

412fb0a

Merge branch 'master' into fix-v2-callback

c238c92

Merge branch 'master' into fix-v2-callback

3334199

Merge branch 'master' into fix-v2-callback

d93be88

justinvyu enabled auto-merge (squash) July 31, 2025 20:42

github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 31, 2025

Merge branch 'master' into fix-v2-callback

d4ffe18

github-actions bot disabled auto-merge August 1, 2025 16:23

liulehui added 2 commits August 1, 2025 09:51

try to fix pydoclint

ce16343

Signed-off-by: Lehui Liu <lehui@anyscale.com>

Merge branch 'master' into fix-v2-callback

6e593ed

matthewdeng approved these changes Aug 4, 2025

View reviewed changes

ci/lint/pydoclint-baseline.txt Show resolved Hide resolved

matthewdeng enabled auto-merge (squash) August 4, 2025 16:22

matthewdeng merged commit c41d157 into ray-project:master Aug 4, 2025
6 checks passed

liulehui deleted the fix-v2-callback branch August 4, 2025 16:53

justinvyu mentioned this pull request Sep 23, 2025

[tune] Enable Train v2 in doc examples #56820

Merged

Conversation

liulehui commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Jul 23, 2025

Uh oh!

gemini-code-assist bot commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liulehui commented Jul 21, 2025 •

edited

Loading