[Data] AutoscalingCoordinator prevent double-allocates resources if there are multiple datasets by machichima · Pull Request #59740 · ray-project/ray

machichima · 2025-12-29T13:36:15Z

Description

Prevent double-allocates remaining resources when there are multiple datasets (multiple requests).

This PR divides the remaining resources with the number of remaining resource requests to form the fair resource allocation

Related issues

Closes #59685

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: machichima <nary12321@gmail.com>

gemini-code-assist

Code Review

This pull request effectively addresses the issue of over-allocating remaining resources when multiple datasets have pending requests. The new logic correctly divides the available resources among the requesters, ensuring a fair distribution. The implementation is sound and is well-supported by a new unit test that verifies the corrected behavior. I have one minor suggestion to improve the readability and efficiency of the resource allocation logic.

python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py

machichima · 2025-12-29T13:37:47Z

@bveeramani I draft a fix to perform the fair remaining resource allocation. PTAL
Thank you!

Signed-off-by: machichima <nary12321@gmail.com>

bveeramani

Overall looks good. Just a few questions

bveeramani · 2025-12-30T09:06:14Z

python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py

+        # NOTE: to handle the case where multiple datasets are running concurrently,
+        # we divide remaining resources equally to all requesters with `request_remaining=True`.
+        num_remaining_requesters = len(remaining_resource_requesters)
+        if num_remaining_requesters > 0:
+            for node_resource in cluster_node_resources:
+                # Divide remaining resources equally among requesters.
+                # NOTE: Integer division may leave some resources unallocated.
+                divided_resource = {
+                    k: v // num_remaining_requesters
+                    for k, v in node_resource.items()
+                }
+                for ongoing_req in remaining_resource_requesters:
+                    if any(v > 0 for v in divided_resource.values()):
+                        ongoing_req.allocated_resources.append(divided_resource)


What happens if we perform true division rather than integer devision? Should we?

I think we can only use integer devision here as the Autoscaler SDK only support integer value?

ray/python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py

Lines 280 to 284 in 25ddfe3

# Round up the resource values to integers,

# because the Autoscaler SDK only accepts integer values.

for r in resources:

for k in r:

r[k] = math.ceil(r[k])

Ah, got it. Underallocating seems reasonable for now

python/ray/data/tests/test_autoscaling_coordinator.py

bveeramani · 2025-12-30T09:19:35Z

python/ray/data/tests/test_autoscaling_coordinator.py

+    with (
+        patch("ray.nodes", return_value=cluster_nodes),
+        patch("time.time", mock_time),
+        patch("ray.autoscaler.sdk.request_resources"),
+    ):


Out-of-scope for this PR, but I think this DefaultAutoscalingCoordinator would be more maintainable long-term if we used explicit seams rather than implicit depedencies:

class DefaultAutoscalingCoordinator: def __init__( self, ..., get_node_resources: Callable[[], List[ResourceDict]], get_time: Callable[[], float] = time.time, request_resources: Callable[[List[ResourceDict], None] = ray.autoscaler.sdk.request_resources )

The problem with patching is that this test will break if we change some of these implementation details (e.g., use time.perf_counter instead of time.time), and it's also less clear what the dependencies of the component actually are.

Make sense! Do you think it's ok for me to do this in the follow-up PR? It seems like we need to do for both DefaultAutoscalingCoordinator and _AutoscalingCoordinatorActor?

Yeah, I think that's totally fine as follow-up.

It seems like we need to do for both DefaultAutoscalingCoordinator and _AutoscalingCoordinatorActor?

I looked through the tests, and I think it's okay if we just add it to _AutoscalingCoordinatorActor for now. Seems like most of the core unit tests (test_basic and the newly-added test_double_allocation_with_multiple_request_remaining) are against the _AutoscalingCoordinatorActor layer of abstraction, rather than DefaultAutoscalingCoordinator

python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py

Signed-off-by: machichima <nary12321@gmail.com>

…s used Signed-off-by: machichima <nary12321@gmail.com>

Signed-off-by: machichima <nary12321@gmail.com>

cursor · 2025-12-30T13:34:40Z

python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py

+                }
+                for ongoing_req in remaining_resource_requesters:
+                    if any(v > 0 for v in divided_resource.values()):
+                        ongoing_req.allocated_resources.append(divided_resource)


Shared dictionary reference across multiple requesters

The divided_resource dictionary is created once per node resource (outside the inner loop) and then the same dictionary object is appended to all requesters' allocated_resources lists. This causes all requesters with request_remaining=True to share references to identical dictionary objects. While this works for read-only access and value comparisons, any modification to a shared dictionary would unintentionally affect all requesters. The dictionary creation should be moved inside the inner loop or a .copy() should be used when appending.

bveeramani

LGTM!

… there are multiple datasets (ray-project#59740) ## Description Prevent double-allocates remaining resources when there are multiple datasets (multiple requests). This PR divides the remaining resources with the number of remaining resource requests to form the fail resource allocation ## Related issues Closes ray-project#59685 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

…60037) ## Description As mentioned in #59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: #59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…ay-project#60037) ## Description As mentioned in ray-project#59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: ray-project#59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…ay-project#60037) ## Description As mentioned in ray-project#59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: ray-project#59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>

… there are multiple datasets (ray-project#59740) ## Description Prevent double-allocates remaining resources when there are multiple datasets (multiple requests). This PR divides the remaining resources with the number of remaining resource requests to form the fail resource allocation ## Related issues Closes ray-project#59685 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…ay-project#60037) ## Description As mentioned in ray-project#59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: ray-project#59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

… there are multiple datasets (ray-project#59740) ## Description Prevent double-allocates remaining resources when there are multiple datasets (multiple requests). This PR divides the remaining resources with the number of remaining resource requests to form the fail resource allocation ## Related issues Closes ray-project#59685 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

…ay-project#60037) ## Description As mentioned in ray-project#59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: ray-project#59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

… there are multiple datasets (ray-project#59740) ## Description Prevent double-allocates remaining resources when there are multiple datasets (multiple requests). This PR divides the remaining resources with the number of remaining resource requests to form the fail resource allocation ## Related issues Closes ray-project#59685 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

…ay-project#60037) ## Description As mentioned in ray-project#59740 (comment), add explicit args in `_AutoscalingCoordinatorActor` constructor to improve maintainability. ## Related issues Follow-up: ray-project#59740 ## Additional information - Pass in mock function in testing as args rather than using `patch` --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

machichima added 2 commits December 29, 2025 20:39

test: add test to show double allocate remaining resources

5af8494

Signed-off-by: machichima <nary12321@gmail.com>

fix: divide remaining resources fairly among multiple datasets

d027cd0

Signed-off-by: machichima <nary12321@gmail.com>

machichima requested a review from a team as a code owner December 29, 2025 13:36

Merge branch 'master' into 59685-double-allocate-resource

d5bba76

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py Outdated Show resolved Hide resolved

machichima added 2 commits December 29, 2025 21:43

fix: better structure

9bee8c6

Signed-off-by: machichima <nary12321@gmail.com>

test: remove redundant assertion

25ddfe3

Signed-off-by: machichima <nary12321@gmail.com>

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Dec 29, 2025

bveeramani reviewed Dec 30, 2025

View reviewed changes

machichima and others added 4 commits December 30, 2025 20:55

test: move test below test_basic

5b4ac5d

Signed-off-by: machichima <nary12321@gmail.com>

refactor: construct remaining_resource_requesters closer to where it'…

cf28a83

…s used Signed-off-by: machichima <nary12321@gmail.com>

refactor: precommit

5bee6d8

Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 59685-double-allocate-resource

b9a42d1

cursor bot reviewed Dec 30, 2025

View reviewed changes

bveeramani approved these changes Dec 31, 2025

View reviewed changes

Merge branch 'master' into 59685-double-allocate-resource

975306b

bveeramani added the go add ONLY when ready to merge, run all tests label Dec 31, 2025

bveeramani merged commit 394cd2d into ray-project:master Dec 31, 2025
6 checks passed

This was referenced Jan 10, 2026

[Data] Introduce seams to DefaultAutoscaler2 to make it more testable #59933

Merged

[Data] add explicit args for mocking _AutoscalingCoordinatorActor #60037

Merged

	# Round up the resource values to integers,
	# because the Autoscaler SDK only accepts integer values.
	for r in resources:
	for k in r:
	r[k] = math.ceil(r[k])

Conversation

machichima commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

machichima commented Dec 29, 2025

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

bveeramani Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

machichima Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bveeramani Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

machichima Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Dec 30, 2025

Choose a reason for hiding this comment

Shared dictionary reference across multiple requesters

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

machichima commented Dec 29, 2025 •

edited

Loading

bveeramani Dec 31, 2025 •

edited

Loading