[train] Register training resources with AutoscalingCoordinator in FixedScalingPolicy by JasonLi1909 · Pull Request #61703 · ray-project/ray

JasonLi1909 · 2026-03-13T00:41:36Z

Summary

The V2 cluster autoscaler divides cluster resources equally among registered data executors via the AutoscalingCoordinator. Previously, DataConfig.configure() added training resources to each dataset's exclude_resources, which ResourceManager.get_global_limits() subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in #61640, and should also help deflake the pytorch:torch_detection tests (buildkite).

Changes

Pull up AutoscalingCoordinator integration into base ScalingPolicy: ElasticScalingPolicy already registered training resources with the coordinator. This PR moves that logic into the base ScalingPolicy class so FixedScalingPolicy gets the same behavior:
- _maybe_send_resource_request(), _send_resource_request(), _cancel_resource_request() defined once in the base class
- Shared constants: AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180, AUTOSCALING_REQUESTS_INTERVAL_S=20, AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5
- Default ControllerCallback lifecycle: after_controller_start registers resources, before_controller_shutdown/before_controller_abort cancel requests
- Subclasses implement _get_num_workers_for_resource_request() to return their worker count (num_workers for Fixed, max_workers for Elastic)
Gate exclude_resources on V1 autoscaler: DataConfig.configure() no longer adds training resources to exclude_resources under V2 (default). A new _is_v2_autoscaler() helper controls this. Under V1 (RAY_DATA_CLUSTER_AUTOSCALER=V1), the old path is preserved.

Considerations when adding new scaling policies

When adding new scaling policies, revisit the shared base class defaults if:

AutoscalingCoordinator integration is not necessary or a different interface becomes available
Timeout and expiry constants need to diverge between policies
_get_num_workers_for_resource_request() needs to accommodate variable worker counts beyond a simple fixed/max value
Controller lifecycle behavior diverges

Tests

test_datasets_callback — updated to assert exclude_resources == zero() under V2
test_data_config_exclude_resources — updated to expect only user-set exclude_resources
test_exclude_train_resources_applies_to_each_dataset — updated to only assert user-defined per-dataset values
test_datasets_callback_v1_uses_exclude_resources — new: verifies exclude_resources is still set under V1
test_v2_no_negative_exclude_resources — new: regression test verifying the scenario that previously caused negative CPU limits no longer fails
test_fixed_scaling_policy_coordinator_lifecycle — new: verifies FixedScalingPolicy registers resources on start, re-requests periodically, and cancels on shutdown/abort
test_data_config_default_resource_limits / _run_data_config_resource_test — updated to reflect V2 behavior
test_elastic_scaling_policy.py — updated imports to use shared constants from base module

…inator request in FixedScalingConfig Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

gemini-code-assist

Code Review

This pull request correctly addresses a bug in FixedScalingPolicy where resource calculations could result in negative values with the V2 autoscaler. The solution, which involves registering training resources with the AutoscalingCoordinator, aligns well with existing patterns in ElasticScalingPolicy and is a robust fix. The accompanying tests are thorough, covering the new logic, backward compatibility for the V1 autoscaler, and the specific regression case. My only suggestion is to refine the exception handling in fixed.py to be more specific, which will improve the code's robustness.

python/ray/train/v2/_internal/execution/scaling_policy/fixed.py

python/ray/train/_internal/data_config.py

python/ray/train/v2/_internal/execution/scaling_policy/fixed.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Ensures the resource request to the AutoscalingCoordinator is kept alive even while the worker group is not yet running, preventing the 180s expiry from dropping the reservation.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

matthewdeng · 2026-03-18T04:38:25Z

python/ray/train/v2/_internal/execution/scaling_policy/fixed.py

Is this the exact same code as in ElasticScalingPolicy? Can you move shared logic into a common utility class and/or into the ScalingPolicy class?

See https://github.com/JasonLi1909/ray/pull/1/changes

Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>

Merging pull up refactor between fixed and elastic scaling policies. This change: Integrates the Ray Data Autoscaling Coordinator as a default component of scaling policies Sets default Scaling Policy lifecycles to request resources after controller start and to cancel requests on abort and shutdown Creates a _get_num_workers_for_resource_request abstract method that standardizes the num workers for every request made. Pulls up common timeout and expiry constants When adding new scaling policies in the future, we should revisit these changes- particularly if: Autoscaling Coordinator integration is not necessary or a different interface becomes available Timeout and expiry constants diverge _get_num_workers_for_resource_request needs to accommodate variable num worker requests Controller lifecycle behavior diverges Will add comments in code for future maintenance

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

…xedScalingPolicy (ray-project#61703) ## Summary The V2 cluster autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. Previously, `DataConfig.configure()` added training resources to each dataset's `exclude_resources`, which `ResourceManager.get_global_limits()` subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in ray-project#61640, and should also help deflake the `pytorch:torch_detection` tests ([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)). ## Changes - **Pull up `AutoscalingCoordinator` integration into base `ScalingPolicy`**: `ElasticScalingPolicy` already registered training resources with the coordinator. This PR moves that logic into the base `ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior: - `_maybe_send_resource_request()`, `_send_resource_request()`, `_cancel_resource_request()` defined once in the base class - Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`, `AUTOSCALING_REQUESTS_INTERVAL_S=20`, `AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5` - Default `ControllerCallback` lifecycle: `after_controller_start` registers resources, `before_controller_shutdown`/`before_controller_abort` cancel requests - Subclasses implement `_get_num_workers_for_resource_request()` to return their worker count (`num_workers` for Fixed, `max_workers` for Elastic) - **Gate `exclude_resources` on V1 autoscaler**: `DataConfig.configure()` no longer adds training resources to `exclude_resources` under V2 (default). A new `_is_v2_autoscaler()` helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the old path is preserved. ### Considerations when adding new scaling policies When adding new scaling policies, revisit the shared base class defaults if: 1. AutoscalingCoordinator integration is not necessary or a different interface becomes available 2. Timeout and expiry constants need to diverge between policies 3. `_get_num_workers_for_resource_request()` needs to accommodate variable worker counts beyond a simple fixed/max value 4. Controller lifecycle behavior diverges ## Tests - `test_datasets_callback` — updated to assert `exclude_resources == zero()` under V2 - `test_data_config_exclude_resources` — updated to expect only user-set `exclude_resources` - `test_exclude_train_resources_applies_to_each_dataset` — updated to only assert user-defined per-dataset values - `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies `exclude_resources` is still set under V1 - `test_v2_no_negative_exclude_resources` — **new**: regression test verifying the scenario that previously caused negative CPU limits no longer fails - `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies `FixedScalingPolicy` registers resources on start, re-requests periodically, and cancels on shutdown/abort - `test_data_config_default_resource_limits` / `_run_data_config_resource_test` — updated to reflect V2 behavior - `test_elastic_scaling_policy.py` — updated imports to use shared constants from base module --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Signed-off-by: Pedro Jeronimo <pedro.jeronimo@tecnico.ulisboa.pt>

…rdinator changes (#61990) After #61703, Ray Train no longer adds training resources to `exclude_resources`. Instead, training resources are registered directly with the `AutoscalingCoordinator`. The docs still said training resources are "automatically excluded," which is misleading. This updates the wording to reflect the new behavior. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

…xedScalingPolicy (ray-project#61703) ## Summary The V2 cluster autoscaler divides cluster resources equally among registered data executors via the `AutoscalingCoordinator`. Previously, `DataConfig.configure()` added training resources to each dataset's `exclude_resources`, which `ResourceManager.get_global_limits()` subtracted from total resources. Under V2, this returns only the executor's allocated share (cluster / num_executors), so subtracting training resources could yield negative limits, causing assertion failures. This was the root cause of the test regression fixed in ray-project#61640, and should also help deflake the `pytorch:torch_detection` tests ([buildkite](https://buildkite.com/ray-project/postmerge/builds/16449#019ce823-a291-474d-823d-adaf6c623ebb)). ## Changes - **Pull up `AutoscalingCoordinator` integration into base `ScalingPolicy`**: `ElasticScalingPolicy` already registered training resources with the coordinator. This PR moves that logic into the base `ScalingPolicy` class so `FixedScalingPolicy` gets the same behavior: - `_maybe_send_resource_request()`, `_send_resource_request()`, `_cancel_resource_request()` defined once in the base class - Shared constants: `AUTOSCALING_REQUESTS_EXPIRE_TIME_S=180`, `AUTOSCALING_REQUESTS_INTERVAL_S=20`, `AUTOSCALING_REQUESTS_GET_TIMEOUT_S=5` - Default `ControllerCallback` lifecycle: `after_controller_start` registers resources, `before_controller_shutdown`/`before_controller_abort` cancel requests - Subclasses implement `_get_num_workers_for_resource_request()` to return their worker count (`num_workers` for Fixed, `max_workers` for Elastic) - **Gate `exclude_resources` on V1 autoscaler**: `DataConfig.configure()` no longer adds training resources to `exclude_resources` under V2 (default). A new `_is_v2_autoscaler()` helper controls this. Under V1 (`RAY_DATA_CLUSTER_AUTOSCALER=V1`), the old path is preserved. ### Considerations when adding new scaling policies When adding new scaling policies, revisit the shared base class defaults if: 1. AutoscalingCoordinator integration is not necessary or a different interface becomes available 2. Timeout and expiry constants need to diverge between policies 3. `_get_num_workers_for_resource_request()` needs to accommodate variable worker counts beyond a simple fixed/max value 4. Controller lifecycle behavior diverges ## Tests - `test_datasets_callback` — updated to assert `exclude_resources == zero()` under V2 - `test_data_config_exclude_resources` — updated to expect only user-set `exclude_resources` - `test_exclude_train_resources_applies_to_each_dataset` — updated to only assert user-defined per-dataset values - `test_datasets_callback_v1_uses_exclude_resources` — **new**: verifies `exclude_resources` is still set under V1 - `test_v2_no_negative_exclude_resources` — **new**: regression test verifying the scenario that previously caused negative CPU limits no longer fails - `test_fixed_scaling_policy_coordinator_lifecycle` — **new**: verifies `FixedScalingPolicy` registers resources on start, re-requests periodically, and cancels on shutdown/abort - `test_data_config_default_resource_limits` / `_run_data_config_resource_test` — updated to reflect V2 behavior - `test_elastic_scaling_policy.py` — updated imports to use shared constants from base module --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>

…rdinator changes (ray-project#61990) After ray-project#61703, Ray Train no longer adds training resources to `exclude_resources`. Instead, training resources are registered directly with the `AutoscalingCoordinator`. The docs still said training resources are "automatically excluded," which is misleading. This updates the wording to reflect the new behavior. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

remove exclude resources usage and replace with new autoscaling coord…

3e3d08b

…inator request in FixedScalingConfig Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner March 13, 2026 00:41

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/scaling_policy/fixed.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/scaling_policy/fixed.py Outdated Show resolved Hide resolved

cursor bot reviewed Mar 13, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/scaling_policy/fixed.py Outdated Show resolved Hide resolved

python/ray/train/_internal/data_config.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/scaling_policy/fixed.py Outdated Show resolved Hide resolved

ray-gardener bot added the train Ray Train Related Issue label Mar 13, 2026

Use ClusterAutoscalerVersion enum instead of string comparison

d1d8ef6

cursor bot reviewed Mar 13, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/scaling_policy/fixed.py Show resolved Hide resolved

JasonLi1909 added 3 commits March 13, 2026 14:22

fixes

166ff79

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Refresh resource request in make_decision_for_non_running_worker_group

9834bc9

Ensures the resource request to the AutoscalingCoordinator is kept alive even while the worker group is not yet running, preventing the 180s expiry from dropping the reservation.

updated failing tests accordingly

3972bbb

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

matthewdeng reviewed Mar 18, 2026

View reviewed changes

[train] refactor scaling policy resources

3f509db

Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>

matthewdeng mentioned this pull request Mar 18, 2026

[train] refactor scaling policy resources JasonLi1909/ray#1

Merged

cursor bot reviewed Mar 18, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/scaling_policy/scaling_policy.py Show resolved Hide resolved

JasonLi1909 added 2 commits March 18, 2026 13:56

fixed mock scaling policy for test

7c0d04b

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

added note to scaling policy docstring

a3e63ef

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

matthewdeng approved these changes Mar 18, 2026

View reviewed changes

matthewdeng enabled auto-merge (squash) March 18, 2026 21:45

github-actions bot added the go add ONLY when ready to merge, run all tests label Mar 18, 2026

matthewdeng merged commit a1c02a8 into ray-project:master Mar 19, 2026
7 of 8 checks passed

JasonLi1909 mentioned this pull request Mar 23, 2026

[Data] Update exclude_resources docs to reflect Train autoscaling coordinator changes #61990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Register training resources with AutoscalingCoordinator in FixedScalingPolicy#61703

[train] Register training resources with AutoscalingCoordinator in FixedScalingPolicy#61703
matthewdeng merged 9 commits intoray-project:masterfrom
JasonLi1909:remove-exclude-resources-usage

JasonLi1909 commented Mar 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng Mar 18, 2026 •

edited

Loading

Uh oh!

matthewdeng Mar 18, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JasonLi1909 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Considerations when adding new scaling policies

Tests

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdeng Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JasonLi1909 commented Mar 13, 2026 •

edited

Loading

matthewdeng Mar 18, 2026 •

edited

Loading