Skip to content

[train] Fix the xgboost v2 callback#54787

Merged
matthewdeng merged 19 commits intoray-project:masterfrom
liulehui:fix-v2-callback
Aug 4, 2025
Merged

[train] Fix the xgboost v2 callback#54787
matthewdeng merged 19 commits intoray-project:masterfrom
liulehui:fix-v2-callback

Conversation

@liulehui
Copy link
Copy Markdown
Contributor

@liulehui liulehui commented Jul 21, 2025

Why are these changes needed?

  1. in the ray train revamp REP, we decouple the ray train/ray tune dependency.
  2. Hence, when using RayTrainReportCallback when reporting metrics or checkpoint: e.g. in this line, the v2 context api will throw RuntimeError.
  3. In V1 this issue is mitigated by switch to Tune Context when train.get_context() is called.
  4. In order to make the xgboost tune only usage callback continue working, hence the bypass the use _is_tune_session() to get context for this callback explicitly if this is used in tune only when we trying to get train context in V2 manner and resolve ray.tune.report is tune only based on migration example here.

Related issue number

NA

Checks

  • [x ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • tested locally
    • Release tests
    • This PR is not tested :(

Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui requested a review from a team as a code owner July 21, 2025 17:46
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @liulehui, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the XGBoost v2 callback within Ray Train to ensure its compatibility and correct functionality when used exclusively with Ray Tune. The changes aim to decouple ray.train and ray.tune dependencies, allowing the callback to adapt its reporting mechanism based on the detected execution environment.

Highlights

  • Decoupling Train/Tune: This pull request is part of a broader initiative to decouple ray.train and ray.tune dependencies, specifically addressing the XGBoost callback's behavior when used in a Tune-only context.
  • Conditional Reporting Logic: The XGBoostCallbacks now intelligently route metric and checkpoint reporting. Depending on whether the code is executing within a ray.tune session, it will conditionally call ray.tune.report or ray.train.report.
  • Tune Session Detection: To facilitate the conditional reporting, the _in_tune_session utility function from ray.tune.trainable.trainable_fn_utils has been imported and integrated to reliably determine the current execution context.
  • Improved Checkpoint Handling: The _get_checkpoint method's logic has been enhanced to correctly manage checkpointing when running in a ray.tune only environment, complementing the existing ray.train world rank checks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to adapt the XGBoost callback for "Tune-only" usage by conditionally selecting between ray.tune.report and ray.train.report. The changes look good, but there's a critical bug in the after_iteration method where a function is not being called correctly, which would lead to incorrect behavior. I've also included a suggestion to refactor a similar block of code to improve maintainability.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui changed the title fix the xgboost v2 callback [train] Fix the xgboost v2 callback Jul 21, 2025
@liulehui liulehui self-assigned this Jul 21, 2025
liulehui added 2 commits July 22, 2025 13:42
1. make `ray.train.xgboost.RayTrainReportCallback` and `ray.tune.integration.xgboost.TuneReportCheckpointCall` implement it
2. each subclass define its own report_fn and get_checkpoint method
3. common logic stays in the abstract class

Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui requested a review from a team as a code owner July 23, 2025 00:06
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

1 similar comment
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

liulehui added 3 commits July 25, 2025 15:05
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
liulehui and others added 4 commits July 29, 2025 13:41
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! 🚀

@justinvyu justinvyu enabled auto-merge (squash) July 31, 2025 20:42
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 31, 2025
@github-actions github-actions bot disabled auto-merge August 1, 2025 16:23
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@matthewdeng matthewdeng enabled auto-merge (squash) August 4, 2025 16:22
@matthewdeng matthewdeng merged commit c41d157 into ray-project:master Aug 4, 2025
6 checks passed
@liulehui liulehui deleted the fix-v2-callback branch August 4, 2025 16:53
elliot-barn pushed a commit that referenced this pull request Aug 4, 2025
1. in the ray train [revamp
REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage),
we decouple the ray train/ray tune dependency.
2. Hence, when using RayTrainReportCallback when reporting metrics or
checkpoint: e.g. in this
[line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170),
the v2 context api will throw
[RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283).
3. In V1 this issue is mitigated by [switch to Tune
Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128)
when train.get_context() is called.
4. In order to make the xgboost tune only usage callback continue
working, hence the bypass the use `_is_tune_session()` to get context
for this callback explicitly if this is used in tune only when we trying
to get train context in V2 manner and resolve ray.tune.report is tune
only based on migration example
[here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage).


Signed-off-by: Lehui Liu <lehui@anyscale.com>
kamil-kaczmarek pushed a commit that referenced this pull request Aug 4, 2025
1. in the ray train [revamp
REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage),
we decouple the ray train/ray tune dependency.
2. Hence, when using RayTrainReportCallback when reporting metrics or
checkpoint: e.g. in this
[line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170),
the v2 context api will throw
[RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283).
3. In V1 this issue is mitigated by [switch to Tune
Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128)
when train.get_context() is called.
4. In order to make the xgboost tune only usage callback continue
working, hence the bypass the use `_is_tune_session()` to get context
for this callback explicitly if this is used in tune only when we trying
to get train context in V2 manner and resolve ray.tune.report is tune
only based on migration example
[here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage).

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
mjacar pushed a commit to mjacar/ray that referenced this pull request Aug 5, 2025
1. in the ray train [revamp
REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage),
we decouple the ray train/ray tune dependency.
2. Hence, when using RayTrainReportCallback when reporting metrics or
checkpoint: e.g. in this
[line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170),
the v2 context api will throw
[RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283).
3. In V1 this issue is mitigated by [switch to Tune
Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128)
when train.get_context() is called.
4. In order to make the xgboost tune only usage callback continue
working, hence the bypass the use `_is_tune_session()` to get context
for this callback explicitly if this is used in tune only when we trying
to get train context in V2 manner and resolve ray.tune.report is tune
only based on migration example
[here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage).

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
sampan-s-nayak pushed a commit that referenced this pull request Aug 12, 2025
1. in the ray train [revamp
REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage),
we decouple the ray train/ray tune dependency.
2. Hence, when using RayTrainReportCallback when reporting metrics or
checkpoint: e.g. in this
[line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170),
the v2 context api will throw
[RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283).
3. In V1 this issue is mitigated by [switch to Tune
Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128)
when train.get_context() is called.
4. In order to make the xgboost tune only usage callback continue
working, hence the bypass the use `_is_tune_session()` to get context
for this callback explicitly if this is used in tune only when we trying
to get train context in V2 manner and resolve ray.tune.report is tune
only based on migration example
[here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage).


Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
1. in the ray train [revamp
REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage),
we decouple the ray train/ray tune dependency.
2. Hence, when using RayTrainReportCallback when reporting metrics or
checkpoint: e.g. in this
[line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170),
the v2 context api will throw
[RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283).
3. In V1 this issue is mitigated by [switch to Tune
Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128)
when train.get_context() is called.
4. In order to make the xgboost tune only usage callback continue
working, hence the bypass the use `_is_tune_session()` to get context
for this callback explicitly if this is used in tune only when we trying
to get train context in V2 manner and resolve ray.tune.report is tune
only based on migration example
[here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage).

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
justinvyu added a commit that referenced this pull request Sep 24, 2025
Flip the flag for Tune doctest CI in preparation for turning on Train V2
by default. This doesn't have any behavior change, but this asserts that
ray.train -> ray.tune updates have all happened.

Note that a few tests have been left behind due to Tune lightgbm and
Keras callbacks not being updated yet. We need to do the equivalent of
this PR: #54787
* `lightgbm_example`
* `lightgbm_example_cv`
* `tune_mnist_keras`

Deletes `horovod_simple.ipynb` example because we don't support
`HorovodTrainer` anymore.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
Flip the flag for Tune doctest CI in preparation for turning on Train V2
by default. This doesn't have any behavior change, but this asserts that
ray.train -> ray.tune updates have all happened.

Note that a few tests have been left behind due to Tune lightgbm and
Keras callbacks not being updated yet. We need to do the equivalent of
this PR: ray-project#54787
* `lightgbm_example`
* `lightgbm_example_cv`
* `tune_mnist_keras`

Deletes `horovod_simple.ipynb` example because we don't support
`HorovodTrainer` anymore.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
Flip the flag for Tune doctest CI in preparation for turning on Train V2
by default. This doesn't have any behavior change, but this asserts that
ray.train -> ray.tune updates have all happened.

Note that a few tests have been left behind due to Tune lightgbm and
Keras callbacks not being updated yet. We need to do the equivalent of
this PR: #54787
* `lightgbm_example`
* `lightgbm_example_cv`
* `tune_mnist_keras`

Deletes `horovod_simple.ipynb` example because we don't support
`HorovodTrainer` anymore.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
1. in the ray train [revamp
REP](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage),
we decouple the ray train/ray tune dependency.
2. Hence, when using RayTrainReportCallback when reporting metrics or
checkpoint: e.g. in this
[line](https://github.com/ray-project/ray/blob/master/python/ray/train/xgboost/_xgboost_utils.py#L170),
the v2 context api will throw
[RuntimeError](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/context.py#L279-L283).
3. In V1 this issue is mitigated by [switch to Tune
Context](https://github.com/ray-project/ray/blob/master/python/ray/train/context.py#L126-L128)
when train.get_context() is called.
4. In order to make the xgboost tune only usage callback continue
working, hence the bypass the use `_is_tune_session()` to get context
for this callback explicitly if this is used in tune only when we trying
to get train context in V2 manner and resolve ray.tune.report is tune
only based on migration example
[here](https://github.com/ray-project/enhancements/blob/main/reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md#tune-only-usage).

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
Flip the flag for Tune doctest CI in preparation for turning on Train V2
by default. This doesn't have any behavior change, but this asserts that
ray.train -> ray.tune updates have all happened.

Note that a few tests have been left behind due to Tune lightgbm and
Keras callbacks not being updated yet. We need to do the equivalent of
this PR: ray-project#54787
* `lightgbm_example`
* `lightgbm_example_cv`
* `tune_mnist_keras`

Deletes `horovod_simple.ipynb` example because we don't support
`HorovodTrainer` anymore.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
Flip the flag for Tune doctest CI in preparation for turning on Train V2
by default. This doesn't have any behavior change, but this asserts that
ray.train -> ray.tune updates have all happened.

Note that a few tests have been left behind due to Tune lightgbm and
Keras callbacks not being updated yet. We need to do the equivalent of
this PR: ray-project#54787
* `lightgbm_example`
* `lightgbm_example_cv`
* `tune_mnist_keras`

Deletes `horovod_simple.ipynb` example because we don't support
`HorovodTrainer` anymore.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Flip the flag for Tune doctest CI in preparation for turning on Train V2
by default. This doesn't have any behavior change, but this asserts that
ray.train -> ray.tune updates have all happened.

Note that a few tests have been left behind due to Tune lightgbm and
Keras callbacks not being updated yet. We need to do the equivalent of
this PR: ray-project#54787
* `lightgbm_example`
* `lightgbm_example_cv`
* `tune_mnist_keras`

Deletes `horovod_simple.ipynb` example because we don't support
`HorovodTrainer` anymore.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
Flip the flag for Tune doctest CI in preparation for turning on Train V2
by default. This doesn't have any behavior change, but this asserts that
ray.train -> ray.tune updates have all happened.

Note that a few tests have been left behind due to Tune lightgbm and
Keras callbacks not being updated yet. We need to do the equivalent of
this PR: ray-project#54787
* `lightgbm_example`
* `lightgbm_example_cv`
* `tune_mnist_keras`

Deletes `horovod_simple.ipynb` example because we don't support
`HorovodTrainer` anymore.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants