[train] Deflake test_xgboost_trainer::test_fit by increasing CPU allocation#61640
Merged
justinvyu merged 1 commit intoray-project:masterfrom Mar 11, 2026
Merged
Conversation
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
justinvyu
approved these changes
Mar 11, 2026
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses a flaky test, test_fit, by increasing its CPU allocation from 4 to 8. This is a temporary mitigation for an issue with the V2 cluster autoscaler. The change is straightforward and well-explained in the description. My only suggestion is to add a TODO comment in the code to capture the context that this is a temporary fix, which will improve long-term maintainability.
abrarsheikh
pushed a commit
that referenced
this pull request
Mar 11, 2026
…cation (#61640) `test_fit` in `test_xgboost_trainer.py` is failing in CI due to a regression from the V2 cluster autoscaler becoming the default (`#60474`). The V2 autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. With 2 datasets (train + valid), each executor gets `4 // 2 = 2` CPUs, but `exclude_resources` subtracts 3 CPUs (1 trainer + 2 workers), resulting in a negative resource limit (`-1 CPU`) and an assertion failure. Increasing from 4 to 8 CPUs avoids this: `8 // 2 = 4`, and `4 - 3 = 1 CPU`, which is non-negative. The root cause is a conflict between the V2 autoscaler's resource partitioning and `exclude_resources` subtraction in `ResourceManager.get_global_limits()`. The `exclude_resources` mechanism assumes `total_resources` reflects the full cluster, but V2 returns only each executor's allocated share. A proper fix should reconcile these two resource-scoping mechanisms. This change unblocks CI in the meantime. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
justinvyu
pushed a commit
that referenced
this pull request
Mar 12, 2026
Extension of #61640. This PR resolves `test_fit_with_advanced_scaling_config`, `test_resume_from_checkpoint`, and `test_checkpoint_freq` in `test_xgboost_trainer` tests from failing in CI by increasing their CPU resource allocations. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
matthewdeng
added a commit
that referenced
this pull request
Mar 19, 2026
…xedScalingPolicy (#61703) ## Summary The V2 cluster autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. Previously, `DataConfig.configure()` added training resources to each dataset's `exclude_resources`, which `ResourceManager.get_global_limits()` subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in #61640, and should also help deflake the `pytorch:torch_detection` tests ([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)). ## Changes - **Pull up `AutoscalingCoordinator` integration into base `ScalingPolicy`**: `ElasticScalingPolicy` already registered training resources with the coordinator. This PR moves that logic into the base `ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior: - `_maybe_send_resource_request()`, `_send_resource_request()`, `_cancel_resource_request()` defined once in the base class - Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`, `AUTOSCALING_REQUESTS_INTERVAL_S=20`, `AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5` - Default `ControllerCallback` lifecycle: `after_controller_start` registers resources, `before_controller_shutdown`/`before_controller_abort` cancel requests - Subclasses implement `_get_num_workers_for_resource_request()` to return their worker count (`num_workers` for Fixed, `max_workers` for Elastic) - **Gate `exclude_resources` on V1 autoscaler**: `DataConfig.configure()` no longer adds training resources to `exclude_resources` under V2 (default). A new `_is_v2_autoscaler()` helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the old path is preserved. ### Considerations when adding new scaling policies When adding new scaling policies, revisit the shared base class defaults if: 1. AutoscalingCoordinator integration is not necessary or a different interface becomes available 2. Timeout and expiry constants need to diverge between policies 3. `_get_num_workers_for_resource_request()` needs to accommodate variable worker counts beyond a simple fixed/max value 4. Controller lifecycle behavior diverges ## Tests - `test_datasets_callback` — updated to assert `exclude_resources == zero()` under V2 - `test_data_config_exclude_resources` — updated to expect only user-set `exclude_resources` - `test_exclude_train_resources_applies_to_each_dataset` — updated to only assert user-defined per-dataset values - `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies `exclude_resources` is still set under V1 - `test_v2_no_negative_exclude_resources` — **new**: regression test verifying the scenario that previously caused negative CPU limits no longer fails - `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies `FixedScalingPolicy` registers resources on start, re-requests periodically, and cancels on shutdown/abort - `test_data_config_default_resource_limits` / `_run_data_config_resource_test` — updated to reflect V2 behavior - `test_elastic_scaling_policy.py` — updated imports to use shared constants from base module --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
pedrojeronim0
pushed a commit
to pedrojeronim0/ray
that referenced
this pull request
Mar 23, 2026
…xedScalingPolicy (ray-project#61703) ## Summary The V2 cluster autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. Previously, `DataConfig.configure()` added training resources to each dataset's `exclude_resources`, which `ResourceManager.get_global_limits()` subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in ray-project#61640, and should also help deflake the `pytorch:torch_detection` tests ([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)). ## Changes - **Pull up `AutoscalingCoordinator` integration into base `ScalingPolicy`**: `ElasticScalingPolicy` already registered training resources with the coordinator. This PR moves that logic into the base `ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior: - `_maybe_send_resource_request()`, `_send_resource_request()`, `_cancel_resource_request()` defined once in the base class - Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`, `AUTOSCALING_REQUESTS_INTERVAL_S=20`, `AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5` - Default `ControllerCallback` lifecycle: `after_controller_start` registers resources, `before_controller_shutdown`/`before_controller_abort` cancel requests - Subclasses implement `_get_num_workers_for_resource_request()` to return their worker count (`num_workers` for Fixed, `max_workers` for Elastic) - **Gate `exclude_resources` on V1 autoscaler**: `DataConfig.configure()` no longer adds training resources to `exclude_resources` under V2 (default). A new `_is_v2_autoscaler()` helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the old path is preserved. ### Considerations when adding new scaling policies When adding new scaling policies, revisit the shared base class defaults if: 1. AutoscalingCoordinator integration is not necessary or a different interface becomes available 2. Timeout and expiry constants need to diverge between policies 3. `_get_num_workers_for_resource_request()` needs to accommodate variable worker counts beyond a simple fixed/max value 4. Controller lifecycle behavior diverges ## Tests - `test_datasets_callback` — updated to assert `exclude_resources == zero()` under V2 - `test_data_config_exclude_resources` — updated to expect only user-set `exclude_resources` - `test_exclude_train_resources_applies_to_each_dataset` — updated to only assert user-defined per-dataset values - `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies `exclude_resources` is still set under V1 - `test_v2_no_negative_exclude_resources` — **new**: regression test verifying the scenario that previously caused negative CPU limits no longer fails - `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies `FixedScalingPolicy` registers resources on start, re-requests periodically, and cancels on shutdown/abort - `test_data_config_default_resource_limits` / `_run_data_config_resource_test` — updated to reflect V2 behavior - `test_elastic_scaling_policy.py` — updated imports to use shared constants from base module --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Pedro Jeronimo <pedro.jeronimo@tecnico.ulisboa.pt>
ryanaoleary
pushed a commit
to ryanaoleary/ray
that referenced
this pull request
Mar 25, 2026
…xedScalingPolicy (ray-project#61703) ## Summary The V2 cluster autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. Previously, `DataConfig.configure()` added training resources to each dataset's `exclude_resources`, which `ResourceManager.get_global_limits()` subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in ray-project#61640, and should also help deflake the `pytorch:torch_detection` tests ([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)). ## Changes - **Pull up `AutoscalingCoordinator` integration into base `ScalingPolicy`**: `ElasticScalingPolicy` already registered training resources with the coordinator. This PR moves that logic into the base `ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior: - `_maybe_send_resource_request()`, `_send_resource_request()`, `_cancel_resource_request()` defined once in the base class - Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`, `AUTOSCALING_REQUESTS_INTERVAL_S=20`, `AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5` - Default `ControllerCallback` lifecycle: `after_controller_start` registers resources, `before_controller_shutdown`/`before_controller_abort` cancel requests - Subclasses implement `_get_num_workers_for_resource_request()` to return their worker count (`num_workers` for Fixed, `max_workers` for Elastic) - **Gate `exclude_resources` on V1 autoscaler**: `DataConfig.configure()` no longer adds training resources to `exclude_resources` under V2 (default). A new `_is_v2_autoscaler()` helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the old path is preserved. ### Considerations when adding new scaling policies When adding new scaling policies, revisit the shared base class defaults if: 1. AutoscalingCoordinator integration is not necessary or a different interface becomes available 2. Timeout and expiry constants need to diverge between policies 3. `_get_num_workers_for_resource_request()` needs to accommodate variable worker counts beyond a simple fixed/max value 4. Controller lifecycle behavior diverges ## Tests - `test_datasets_callback` — updated to assert `exclude_resources == zero()` under V2 - `test_data_config_exclude_resources` — updated to expect only user-set `exclude_resources` - `test_exclude_train_resources_applies_to_each_dataset` — updated to only assert user-defined per-dataset values - `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies `exclude_resources` is still set under V1 - `test_v2_no_negative_exclude_resources` — **new**: regression test verifying the scenario that previously caused negative CPU limits no longer fails - `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies `FixedScalingPolicy` registers resources on start, re-requests periodically, and cancels on shutdown/abort - `test_data_config_default_resource_limits` / `_run_data_config_resource_test` — updated to reflect V2 behavior - `test_elastic_scaling_policy.py` — updated imports to use shared constants from base module --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_fitintest_xgboost_trainer.pyis failing in CI due to a regression from the V2 cluster autoscaler becoming the default (#60474). The V2 autoscaler divides cluster resources equally among registered data executors via theAutoscalingCoordinator. With 2 datasets (train + valid), each executor gets4 // 2 = 2CPUs, butexclude_resourcessubtracts 3 CPUs (1 trainer + 2 workers), resulting in a negative resource limit (-1 CPU) and an assertion failure. Increasing from 4 to 8 CPUs avoids this:8 // 2 = 4, and4 - 3 = 1 CPU, which is non-negative.Changes
test_fitintest_xgboost_trainer.pyto useray_start_8_cpusfixture instead ofray_start_4_cpusWhy this is a temporary mitigation
The root cause is a conflict between the V2 autoscaler's resource partitioning and
exclude_resourcessubtraction inResourceManager.get_global_limits(). Theexclude_resourcesmechanism assumestotal_resourcesreflects the full cluster, but V2 returns only each executor's allocated share. A proper fix should reconcile these two resource-scoping mechanisms. This change unblocks CI in the meantime.Tests
No new tests. This modifies the existing
test_fitto run with sufficient resources to avoid the V2 autoscaler regression.