Skip to content

[Data] Fix bug where AutoscalingCoordinator crashes if you request 0 GPUs on CPU-only cluster#59514

Merged
bveeramani merged 1 commit intomasterfrom
fix-autoscaling-coordinator
Dec 23, 2025
Merged

[Data] Fix bug where AutoscalingCoordinator crashes if you request 0 GPUs on CPU-only cluster#59514
bveeramani merged 1 commit intomasterfrom
fix-autoscaling-coordinator

Conversation

@bveeramani
Copy link
Copy Markdown
Member

If you request zero GPUs from the autoscaling coordinator but GPUs don't exist on the cluster, the autoscaling coordinator crashes.

This PR fixes that bug.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner December 17, 2025 20:33
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a crash in the AutoscalingCoordinator that occurs when requesting a resource with a value of 0 (like GPU: 0) on a cluster that doesn't have that resource type. The fix in _maybe_subtract_resources correctly prevents a KeyError by checking if the resource key exists before attempting subtraction. I've suggested a small simplification to this logic that leverages an existing invariant to make the code slightly cleaner and more efficient. The added regression test is well-written and correctly covers the bug scenario.

Comment on lines +355 to +356
if key in res1:
res1[key] -= res2[key]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this change correctly fixes the KeyError, the logic can be simplified. The any() check on line 352 establishes an invariant: for any resource key with a requested value res2[key] > 0, we know that res1.get(key, 0) >= res2[key], which implies key must be present in res1. For resources where res2[key] == 0, no subtraction is needed. Therefore, we can simplify the condition inside the loop to only act on positive resource requests.

Suggested change
if key in res1:
res1[key] -= res2[key]
if res2[key] > 0:
res1[key] -= res2[key]

Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check anywhere else?

for key in res2:
res1[key] -= res2[key]
if key in res1:
res1[key] -= res2[key]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res1 and res2 lol

@bveeramani
Copy link
Copy Markdown
Member Author

Do we need this check anywhere else?

Not that I know of.

aslonnie pushed a commit that referenced this pull request Dec 17, 2025
… you request 0 GPUs on CPU-only cluster (#59516)

Cherry-pick of #59514

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
srinathk10 pushed a commit that referenced this pull request Dec 17, 2025
… you request 0 GPUs on CPU-only cluster (#59516)

Cherry-pick of #59514

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Dec 18, 2025
@bveeramani bveeramani enabled auto-merge (squash) December 22, 2025 22:57
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 22, 2025
@bveeramani bveeramani merged commit f906c39 into master Dec 23, 2025
7 checks passed
@bveeramani bveeramani deleted the fix-autoscaling-coordinator branch December 23, 2025 00:06
weiquanlee pushed a commit to antgroup/ant-ray that referenced this pull request Jan 5, 2026
… you request 0 GPUs on CPU-only cluster (ray-project#59516)

Cherry-pick of ray-project#59514

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…0 GPUs on CPU-only cluster (ray-project#59514)

If you request zero GPUs from the autoscaling coordinator but GPUs don't
exist on the cluster, the autoscaling coordinator crashes.

This PR fixes that bug.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
…0 GPUs on CPU-only cluster (ray-project#59514)

If you request zero GPUs from the autoscaling coordinator but GPUs don't
exist on the cluster, the autoscaling coordinator crashes.

This PR fixes that bug.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…0 GPUs on CPU-only cluster (ray-project#59514)

If you request zero GPUs from the autoscaling coordinator but GPUs don't
exist on the cluster, the autoscaling coordinator crashes.

This PR fixes that bug.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants