[train] Deflake test_xgboost_trainer::test_fit by increasing CPU allocation by JasonLi1909 · Pull Request #61640 · ray-project/ray

JasonLi1909 · 2026-03-11T00:03:23Z

Summary

test_fit in test_xgboost_trainer.py is failing in CI due to a regression from the V2 cluster autoscaler becoming the default (#60474). The V2 autoscaler divides cluster resources equally among registered data executors via the AutoscalingCoordinator. With 2 datasets (train + valid), each executor gets 4 // 2 = 2 CPUs, but exclude_resources subtracts 3 CPUs (1 trainer + 2 workers), resulting in a negative resource limit (-1 CPU) and an assertion failure. Increasing from 4 to 8 CPUs avoids this: 8 // 2 = 4, and 4 - 3 = 1 CPU, which is non-negative.

Changes

Update test_fit in test_xgboost_trainer.py to use ray_start_8_cpus fixture instead of ray_start_4_cpus

Why this is a temporary mitigation

The root cause is a conflict between the V2 autoscaler's resource partitioning and exclude_resources subtraction in ResourceManager.get_global_limits(). The exclude_resources mechanism assumes total_resources reflects the full cluster, but V2 returns only each executor's allocated share. A proper fix should reconcile these two resource-scoping mechanisms. This change unblocks CI in the meantime.

Tests

No new tests. This modifies the existing test_fit to run with sufficient resources to avoid the V2 autoscaler regression.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

gemini-code-assist

Code Review

This pull request addresses a flaky test, test_fit, by increasing its CPU allocation from 4 to 8. This is a temporary mitigation for an issue with the V2 cluster autoscaler. The change is straightforward and well-explained in the description. My only suggestion is to add a TODO comment in the code to capture the context that this is a temporary fix, which will improve long-term maintainability.

python/ray/train/tests/test_xgboost_trainer.py

…cation (#61640) `test_fit` in `test_xgboost_trainer.py` is failing in CI due to a regression from the V2 cluster autoscaler becoming the default (`#60474`). The V2 autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. With 2 datasets (train + valid), each executor gets `4 // 2 = 2` CPUs, but `exclude_resources` subtracts 3 CPUs (1 trainer + 2 workers), resulting in a negative resource limit (`-1 CPU`) and an assertion failure. Increasing from 4 to 8 CPUs avoids this: `8 // 2 = 4`, and `4 - 3 = 1 CPU`, which is non-negative. The root cause is a conflict between the V2 autoscaler's resource partitioning and `exclude_resources` subtraction in `ResourceManager.get_global_limits()`. The `exclude_resources` mechanism assumes `total_resources` reflects the full cluster, but V2 returns only each executor's allocated share. A proper fix should reconcile these two resource-scoping mechanisms. This change unblocks CI in the meantime. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Extension of #61640. This PR resolves `test_fit_with_advanced_scaling_config`, `test_resume_from_checkpoint`, and `test_checkpoint_freq` in `test_xgboost_trainer` tests from failing in CI by increasing their CPU resource allocations. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

…xedScalingPolicy (#61703) ## Summary The V2 cluster autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. Previously, `DataConfig.configure()` added training resources to each dataset's `exclude_resources`, which `ResourceManager.get_global_limits()` subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in #61640, and should also help deflake the `pytorch:torch_detection` tests ([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)). ## Changes - **Pull up `AutoscalingCoordinator` integration into base `ScalingPolicy`**: `ElasticScalingPolicy` already registered training resources with the coordinator. This PR moves that logic into the base `ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior: - `_maybe_send_resource_request()`, `_send_resource_request()`, `_cancel_resource_request()` defined once in the base class - Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`, `AUTOSCALING_REQUESTS_INTERVAL_S=20`, `AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5` - Default `ControllerCallback` lifecycle: `after_controller_start` registers resources, `before_controller_shutdown`/`before_controller_abort` cancel requests - Subclasses implement `_get_num_workers_for_resource_request()` to return their worker count (`num_workers` for Fixed, `max_workers` for Elastic) - **Gate `exclude_resources` on V1 autoscaler**: `DataConfig.configure()` no longer adds training resources to `exclude_resources` under V2 (default). A new `_is_v2_autoscaler()` helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the old path is preserved. ### Considerations when adding new scaling policies When adding new scaling policies, revisit the shared base class defaults if: 1. AutoscalingCoordinator integration is not necessary or a different interface becomes available 2. Timeout and expiry constants need to diverge between policies 3. `_get_num_workers_for_resource_request()` needs to accommodate variable worker counts beyond a simple fixed/max value 4. Controller lifecycle behavior diverges ## Tests - `test_datasets_callback` — updated to assert `exclude_resources == zero()` under V2 - `test_data_config_exclude_resources` — updated to expect only user-set `exclude_resources` - `test_exclude_train_resources_applies_to_each_dataset` — updated to only assert user-defined per-dataset values - `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies `exclude_resources` is still set under V1 - `test_v2_no_negative_exclude_resources` — **new**: regression test verifying the scenario that previously caused negative CPU limits no longer fails - `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies `FixedScalingPolicy` registers resources on start, re-requests periodically, and cancels on shutdown/abort - `test_data_config_default_resource_limits` / `_run_data_config_resource_test` — updated to reflect V2 behavior - `test_elastic_scaling_policy.py` — updated imports to use shared constants from base module --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>

…xedScalingPolicy (ray-project#61703) ## Summary The V2 cluster autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. Previously, `DataConfig.configure()` added training resources to each dataset's `exclude_resources`, which `ResourceManager.get_global_limits()` subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in ray-project#61640, and should also help deflake the `pytorch:torch_detection` tests ([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)). ## Changes - **Pull up `AutoscalingCoordinator` integration into base `ScalingPolicy`**: `ElasticScalingPolicy` already registered training resources with the coordinator. This PR moves that logic into the base `ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior: - `_maybe_send_resource_request()`, `_send_resource_request()`, `_cancel_resource_request()` defined once in the base class - Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`, `AUTOSCALING_REQUESTS_INTERVAL_S=20`, `AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5` - Default `ControllerCallback` lifecycle: `after_controller_start` registers resources, `before_controller_shutdown`/`before_controller_abort` cancel requests - Subclasses implement `_get_num_workers_for_resource_request()` to return their worker count (`num_workers` for Fixed, `max_workers` for Elastic) - **Gate `exclude_resources` on V1 autoscaler**: `DataConfig.configure()` no longer adds training resources to `exclude_resources` under V2 (default). A new `_is_v2_autoscaler()` helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the old path is preserved. ### Considerations when adding new scaling policies When adding new scaling policies, revisit the shared base class defaults if: 1. AutoscalingCoordinator integration is not necessary or a different interface becomes available 2. Timeout and expiry constants need to diverge between policies 3. `_get_num_workers_for_resource_request()` needs to accommodate variable worker counts beyond a simple fixed/max value 4. Controller lifecycle behavior diverges ## Tests - `test_datasets_callback` — updated to assert `exclude_resources == zero()` under V2 - `test_data_config_exclude_resources` — updated to expect only user-set `exclude_resources` - `test_exclude_train_resources_applies_to_each_dataset` — updated to only assert user-defined per-dataset values - `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies `exclude_resources` is still set under V1 - `test_v2_no_negative_exclude_resources` — **new**: regression test verifying the scenario that previously caused negative CPU limits no longer fails - `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies `FixedScalingPolicy` registers resources on start, re-requests periodically, and cancels on shutdown/abort - `test_data_config_default_resource_limits` / `_run_data_config_resource_test` — updated to reflect V2 behavior - `test_elastic_scaling_policy.py` — updated imports to use shared constants from base module --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Pedro Jeronimo <pedro.jeronimo@tecnico.ulisboa.pt>

…xedScalingPolicy (ray-project#61703) ## Summary The V2 cluster autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. Previously, `DataConfig.configure()` added training resources to each dataset's `exclude_resources`, which `ResourceManager.get_global_limits()` subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in ray-project#61640, and should also help deflake the `pytorch:torch_detection` tests ([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)). ## Changes - **Pull up `AutoscalingCoordinator` integration into base `ScalingPolicy`**: `ElasticScalingPolicy` already registered training resources with the coordinator. This PR moves that logic into the base `ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior: - `_maybe_send_resource_request()`, `_send_resource_request()`, `_cancel_resource_request()` defined once in the base class - Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`, `AUTOSCALING_REQUESTS_INTERVAL_S=20`, `AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5` - Default `ControllerCallback` lifecycle: `after_controller_start` registers resources, `before_controller_shutdown`/`before_controller_abort` cancel requests - Subclasses implement `_get_num_workers_for_resource_request()` to return their worker count (`num_workers` for Fixed, `max_workers` for Elastic) - **Gate `exclude_resources` on V1 autoscaler**: `DataConfig.configure()` no longer adds training resources to `exclude_resources` under V2 (default). A new `_is_v2_autoscaler()` helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the old path is preserved. ### Considerations when adding new scaling policies When adding new scaling policies, revisit the shared base class defaults if: 1. AutoscalingCoordinator integration is not necessary or a different interface becomes available 2. Timeout and expiry constants need to diverge between policies 3. `_get_num_workers_for_resource_request()` needs to accommodate variable worker counts beyond a simple fixed/max value 4. Controller lifecycle behavior diverges ## Tests - `test_datasets_callback` — updated to assert `exclude_resources == zero()` under V2 - `test_data_config_exclude_resources` — updated to expect only user-set `exclude_resources` - `test_exclude_train_resources_applies_to_each_dataset` — updated to only assert user-defined per-dataset values - `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies `exclude_resources` is still set under V1 - `test_v2_no_negative_exclude_resources` — **new**: regression test verifying the scenario that previously caused negative CPU limits no longer fails - `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies `FixedScalingPolicy` registers resources on start, re-requests periodically, and cancels on shutdown/abort - `test_data_config_default_resource_limits` / `_run_data_config_resource_test` — updated to reflect V2 behavior - `test_elastic_scaling_policy.py` — updated imports to use shared constants from base module --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>

increased the number cpus allocated to this test

1bbe876

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner March 11, 2026 00:03

justinvyu approved these changes Mar 11, 2026

View reviewed changes

justinvyu enabled auto-merge (squash) March 11, 2026 00:14

github-actions bot added the go add ONLY when ready to merge, run all tests label Mar 11, 2026

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

python/ray/train/tests/test_xgboost_trainer.py Show resolved Hide resolved

justinvyu merged commit 3f72ae3 into ray-project:master Mar 11, 2026
7 checks passed

JasonLi1909 mentioned this pull request Mar 11, 2026

Deflake test_xgboost_trainer tests #61655

Merged

JasonLi1909 mentioned this pull request Mar 13, 2026

[train] Register training resources with AutoscalingCoordinator in FixedScalingPolicy #61703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Deflake test_xgboost_trainer::test_fit by increasing CPU allocation#61640

[train] Deflake test_xgboost_trainer::test_fit by increasing CPU allocation#61640
justinvyu merged 1 commit intoray-project:masterfrom
JasonLi1909:deflake_xgboost_test

JasonLi1909 commented Mar 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JasonLi1909 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why this is a temporary mitigation

Tests

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JasonLi1909 commented Mar 11, 2026 •

edited

Loading