Skip to content

[Data][Cherry-pick] Fix bug where AutoscalingCoordinator crashes if you request 0 GPUs on CPU-only cluster #59516

Merged
aslonnie merged 1 commit intoreleases/2.53.0from
cp-59514
Dec 17, 2025
Merged

[Data][Cherry-pick] Fix bug where AutoscalingCoordinator crashes if you request 0 GPUs on CPU-only cluster #59516
aslonnie merged 1 commit intoreleases/2.53.0from
cp-59514

Conversation

@bveeramani
Copy link
Copy Markdown
Member

Cherry-pick of #59514

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner December 17, 2025 20:52
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request fixes a KeyError in AutoscalingCoordinator that occurred when requesting a resource with a value of 0 on a cluster where that resource type is not present (e.g., requesting 0 GPUs on a CPU-only cluster). The fix in _maybe_subtract_resources correctly prevents this crash by checking for the key's existence before attempting subtraction. A regression test has been added to cover this scenario. My review includes a suggestion to make this new test more robust by explicitly mocking the cluster state to ensure it runs deterministically, regardless of the test environment's hardware.

Comment on lines +399 to +410
def test_coordinator_accepts_zero_resource_for_missing_resource_type(
teardown_autoscaling_coordinator,
):
# This is a regression test for a bug where the coordinator crashes when you request
# a resource type (e.g., GPU: 0) that doesn't exist on the cluster.
coordinator = DefaultAutoscalingCoordinator()

coordinator.request_resources(
requester_id="spam", resources=[{"CPU": 1, "GPU": 0}], expire_after_s=1
)

assert coordinator.get_allocated_resources("spam") == [{"CPU": 1, "GPU": 0}]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new regression test is a great addition. To make it more robust and not dependent on the hardware of the execution environment, I suggest mocking ray.nodes() to simulate a cluster without GPUs. This ensures the test deterministically covers the intended scenario where a requested resource type is absent from the cluster.

Suggested change
def test_coordinator_accepts_zero_resource_for_missing_resource_type(
teardown_autoscaling_coordinator,
):
# This is a regression test for a bug where the coordinator crashes when you request
# a resource type (e.g., GPU: 0) that doesn't exist on the cluster.
coordinator = DefaultAutoscalingCoordinator()
coordinator.request_resources(
requester_id="spam", resources=[{"CPU": 1, "GPU": 0}], expire_after_s=1
)
assert coordinator.get_allocated_resources("spam") == [{"CPU": 1, "GPU": 0}]
def test_coordinator_accepts_zero_resource_for_missing_resource_type(
teardown_autoscaling_coordinator,
):
# This is a regression test for a bug where the coordinator crashes when you request
# a resource type (e.g., GPU: 0) that doesn't exist on the cluster.
cluster_with_no_gpus = [
{
"Resources": {"CPU": 2},
"Alive": True,
}
]
with patch("ray.nodes", return_value=cluster_with_no_gpus):
coordinator = DefaultAutoscalingCoordinator()
coordinator.request_resources(
requester_id="spam", resources=[{"CPU": 1, "GPU": 0}], expire_after_s=1
)
assert coordinator.get_allocated_resources("spam") == [{"CPU": 1, "GPU": 0}]

@aslonnie aslonnie merged commit 1736619 into releases/2.53.0 Dec 17, 2025
4 of 5 checks passed
@aslonnie aslonnie deleted the cp-59514 branch December 17, 2025 21:30
srinathk10 pushed a commit that referenced this pull request Dec 17, 2025
… you request 0 GPUs on CPU-only cluster (#59516)

Cherry-pick of #59514

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
weiquanlee pushed a commit to antgroup/ant-ray that referenced this pull request Jan 5, 2026
… you request 0 GPUs on CPU-only cluster (ray-project#59516)

Cherry-pick of ray-project#59514

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants