Skip to content

[train] Deflake test_xgboost_trainer::test_fit by increasing CPU allocation#61640

Merged
justinvyu merged 1 commit intoray-project:masterfrom
JasonLi1909:deflake_xgboost_test
Mar 11, 2026
Merged

[train] Deflake test_xgboost_trainer::test_fit by increasing CPU allocation#61640
justinvyu merged 1 commit intoray-project:masterfrom
JasonLi1909:deflake_xgboost_test

Conversation

@JasonLi1909
Copy link
Copy Markdown
Contributor

@JasonLi1909 JasonLi1909 commented Mar 11, 2026

Summary

test_fit in test_xgboost_trainer.py is failing in CI due to a regression from the V2 cluster autoscaler becoming the default (#60474). The V2 autoscaler divides cluster resources equally among registered data executors via the AutoscalingCoordinator. With 2 datasets (train + valid), each executor gets 4 // 2 = 2 CPUs, but exclude_resources subtracts 3 CPUs (1 trainer + 2 workers), resulting in a negative resource limit (-1 CPU) and an assertion failure. Increasing from 4 to 8 CPUs avoids this: 8 // 2 = 4, and 4 - 3 = 1 CPU, which is non-negative.

Changes

  • Update test_fit in test_xgboost_trainer.py to use ray_start_8_cpus fixture instead of ray_start_4_cpus

Why this is a temporary mitigation

The root cause is a conflict between the V2 autoscaler's resource partitioning and exclude_resources subtraction in ResourceManager.get_global_limits(). The exclude_resources mechanism assumes total_resources reflects the full cluster, but V2 returns only each executor's allocated share. A proper fix should reconcile these two resource-scoping mechanisms. This change unblocks CI in the meantime.

Tests

No new tests. This modifies the existing test_fit to run with sufficient resources to avoid the V2 autoscaler regression.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner March 11, 2026 00:03
@justinvyu justinvyu enabled auto-merge (squash) March 11, 2026 00:14
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Mar 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a flaky test, test_fit, by increasing its CPU allocation from 4 to 8. This is a temporary mitigation for an issue with the V2 cluster autoscaler. The change is straightforward and well-explained in the description. My only suggestion is to add a TODO comment in the code to capture the context that this is a temporary fix, which will improve long-term maintainability.

@justinvyu justinvyu merged commit 3f72ae3 into ray-project:master Mar 11, 2026
7 checks passed
abrarsheikh pushed a commit that referenced this pull request Mar 11, 2026
…cation (#61640)

`test_fit` in `test_xgboost_trainer.py` is failing in CI due to a
regression from the V2 cluster autoscaler becoming the default
(`#60474`). The V2 autoscaler divides cluster resources equally among
registered data executors via the `AutoscalingCoordinator`. With 2
datasets (train + valid), each executor gets `4 // 2 = 2` CPUs, but
`exclude_resources` subtracts 3 CPUs (1 trainer + 2 workers), resulting
in a negative resource limit (`-1 CPU`) and an assertion failure.
Increasing from 4 to 8 CPUs avoids this: `8 // 2 = 4`, and `4 - 3 = 1
CPU`, which is non-negative.

The root cause is a conflict between the V2 autoscaler's resource
partitioning and `exclude_resources` subtraction in
`ResourceManager.get_global_limits()`. The `exclude_resources` mechanism
assumes `total_resources` reflects the full cluster, but V2 returns only
each executor's allocated share. A proper fix should reconcile these two
resource-scoping mechanisms. This change unblocks CI in the meantime.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
justinvyu pushed a commit that referenced this pull request Mar 12, 2026
Extension of #61640. This PR resolves
`test_fit_with_advanced_scaling_config`, `test_resume_from_checkpoint`,
and `test_checkpoint_freq` in `test_xgboost_trainer` tests from failing
in CI by increasing their CPU resource allocations.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
matthewdeng added a commit that referenced this pull request Mar 19, 2026
…xedScalingPolicy (#61703)

## Summary

The V2 cluster autoscaler divides cluster resources equally among
registered data executors via the `AutoscalingCoordinator`. Previously,
`DataConfig.configure()` added training resources to each dataset's
`exclude_resources`, which `ResourceManager.get_global_limits()`
subtracted from total resources. Under V2, this returns only the
executor's allocated share (cluster / num_executors), so subtracting
training resources could yield negative limits, causing assertion
failures. This was the root cause of the test regression fixed in
#61640, and should also help deflake the `pytorch:torch_detection` tests
([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)).

## Changes

- **Pull up `AutoscalingCoordinator` integration into base
`ScalingPolicy`**: `ElasticScalingPolicy` already registered training
resources with the coordinator. This PR moves that logic into the base
`ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior:
- `_maybe_send_resource_request()`, `_send_resource_request()`,
`_cancel_resource_request()` defined once in the base class
- Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`,
`AUTOSCALING_REQUESTS_INTERVAL_S=20`,
`AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5`
- Default `ControllerCallback` lifecycle: `after_controller_start`
registers resources,
`before_controller_shutdown`/`before_controller_abort` cancel requests
- Subclasses implement `_get_num_workers_for_resource_request()` to
return their worker count (`num_workers` for Fixed, `max_workers` for
Elastic)
- **Gate `exclude_resources` on V1 autoscaler**:
`DataConfig.configure()` no longer adds training resources to
`exclude_resources` under V2 (default). A new `_is_v2_autoscaler()`
helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the
old path is preserved.

### Considerations when adding new scaling policies

When adding new scaling policies, revisit the shared base class defaults
if:

1. AutoscalingCoordinator integration is not necessary or a different
interface becomes available
2. Timeout and expiry constants need to diverge between policies
3. `_get_num_workers_for_resource_request()` needs to accommodate
variable worker counts beyond a simple fixed/max value
4. Controller lifecycle behavior diverges 

## Tests

- `test_datasets_callback` — updated to assert `exclude_resources ==
zero()` under V2
- `test_data_config_exclude_resources` — updated to expect only user-set
`exclude_resources`
- `test_exclude_train_resources_applies_to_each_dataset` — updated to
only assert user-defined per-dataset values
- `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies
`exclude_resources` is still set under V1
- `test_v2_no_negative_exclude_resources` — **new**: regression test
verifying the scenario that previously caused negative CPU limits no
longer fails
- `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies
`FixedScalingPolicy` registers resources on start, re-requests
periodically, and cancels on shutdown/abort
- `test_data_config_default_resource_limits` /
`_run_data_config_resource_test` — updated to reflect V2 behavior
- `test_elastic_scaling_policy.py` — updated imports to use shared
constants from base module

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
pedrojeronim0 pushed a commit to pedrojeronim0/ray that referenced this pull request Mar 23, 2026
…xedScalingPolicy (ray-project#61703)

## Summary

The V2 cluster autoscaler divides cluster resources equally among
registered data executors via the `AutoscalingCoordinator`. Previously,
`DataConfig.configure()` added training resources to each dataset's
`exclude_resources`, which `ResourceManager.get_global_limits()`
subtracted from total resources. Under V2, this returns only the
executor's allocated share (cluster / num_executors), so subtracting
training resources could yield negative limits, causing assertion
failures. This was the root cause of the test regression fixed in
ray-project#61640, and should also help deflake the `pytorch:torch_detection` tests
([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)).

## Changes

- **Pull up `AutoscalingCoordinator` integration into base
`ScalingPolicy`**: `ElasticScalingPolicy` already registered training
resources with the coordinator. This PR moves that logic into the base
`ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior:
- `_maybe_send_resource_request()`, `_send_resource_request()`,
`_cancel_resource_request()` defined once in the base class
- Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`,
`AUTOSCALING_REQUESTS_INTERVAL_S=20`,
`AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5`
- Default `ControllerCallback` lifecycle: `after_controller_start`
registers resources,
`before_controller_shutdown`/`before_controller_abort` cancel requests
- Subclasses implement `_get_num_workers_for_resource_request()` to
return their worker count (`num_workers` for Fixed, `max_workers` for
Elastic)
- **Gate `exclude_resources` on V1 autoscaler**:
`DataConfig.configure()` no longer adds training resources to
`exclude_resources` under V2 (default). A new `_is_v2_autoscaler()`
helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the
old path is preserved.

### Considerations when adding new scaling policies

When adding new scaling policies, revisit the shared base class defaults
if:

1. AutoscalingCoordinator integration is not necessary or a different
interface becomes available
2. Timeout and expiry constants need to diverge between policies
3. `_get_num_workers_for_resource_request()` needs to accommodate
variable worker counts beyond a simple fixed/max value
4. Controller lifecycle behavior diverges

## Tests

- `test_datasets_callback` — updated to assert `exclude_resources ==
zero()` under V2
- `test_data_config_exclude_resources` — updated to expect only user-set
`exclude_resources`
- `test_exclude_train_resources_applies_to_each_dataset` — updated to
only assert user-defined per-dataset values
- `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies
`exclude_resources` is still set under V1
- `test_v2_no_negative_exclude_resources` — **new**: regression test
verifying the scenario that previously caused negative CPU limits no
longer fails
- `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies
`FixedScalingPolicy` registers resources on start, re-requests
periodically, and cancels on shutdown/abort
- `test_data_config_default_resource_limits` /
`_run_data_config_resource_test` — updated to reflect V2 behavior
- `test_elastic_scaling_policy.py` — updated imports to use shared
constants from base module

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Signed-off-by: Pedro Jeronimo <pedro.jeronimo@tecnico.ulisboa.pt>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Mar 25, 2026
…xedScalingPolicy (ray-project#61703)

## Summary

The V2 cluster autoscaler divides cluster resources equally among
registered data executors via the `AutoscalingCoordinator`. Previously,
`DataConfig.configure()` added training resources to each dataset's
`exclude_resources`, which `ResourceManager.get_global_limits()`
subtracted from total resources. Under V2, this returns only the
executor's allocated share (cluster / num_executors), so subtracting
training resources could yield negative limits, causing assertion
failures. This was the root cause of the test regression fixed in
ray-project#61640, and should also help deflake the `pytorch:torch_detection` tests
([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)).

## Changes

- **Pull up `AutoscalingCoordinator` integration into base
`ScalingPolicy`**: `ElasticScalingPolicy` already registered training
resources with the coordinator. This PR moves that logic into the base
`ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior:
- `_maybe_send_resource_request()`, `_send_resource_request()`,
`_cancel_resource_request()` defined once in the base class
- Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`,
`AUTOSCALING_REQUESTS_INTERVAL_S=20`,
`AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5`
- Default `ControllerCallback` lifecycle: `after_controller_start`
registers resources,
`before_controller_shutdown`/`before_controller_abort` cancel requests
- Subclasses implement `_get_num_workers_for_resource_request()` to
return their worker count (`num_workers` for Fixed, `max_workers` for
Elastic)
- **Gate `exclude_resources` on V1 autoscaler**:
`DataConfig.configure()` no longer adds training resources to
`exclude_resources` under V2 (default). A new `_is_v2_autoscaler()`
helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the
old path is preserved.

### Considerations when adding new scaling policies

When adding new scaling policies, revisit the shared base class defaults
if:

1. AutoscalingCoordinator integration is not necessary or a different
interface becomes available
2. Timeout and expiry constants need to diverge between policies
3. `_get_num_workers_for_resource_request()` needs to accommodate
variable worker counts beyond a simple fixed/max value
4. Controller lifecycle behavior diverges 

## Tests

- `test_datasets_callback` — updated to assert `exclude_resources ==
zero()` under V2
- `test_data_config_exclude_resources` — updated to expect only user-set
`exclude_resources`
- `test_exclude_train_resources_applies_to_each_dataset` — updated to
only assert user-defined per-dataset values
- `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies
`exclude_resources` is still set under V1
- `test_v2_no_negative_exclude_resources` — **new**: regression test
verifying the scenario that previously caused negative CPU limits no
longer fails
- `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies
`FixedScalingPolicy` registers resources on start, re-requests
periodically, and cancels on shutdown/abort
- `test_data_config_default_resource_limits` /
`_run_data_config_resource_test` — updated to reflect V2 behavior
- `test_elastic_scaling_policy.py` — updated imports to use shared
constants from base module

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants