[Data] Fix bug where AutoscalingCoordinator crashes if you request 0 GPUs on CPU-only cluster#59514
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Code Review
This pull request fixes a crash in the AutoscalingCoordinator that occurs when requesting a resource with a value of 0 (like GPU: 0) on a cluster that doesn't have that resource type. The fix in _maybe_subtract_resources correctly prevents a KeyError by checking if the resource key exists before attempting subtraction. I've suggested a small simplification to this logic that leverages an existing invariant to make the code slightly cleaner and more efficient. The added regression test is well-written and correctly covers the bug scenario.
| if key in res1: | ||
| res1[key] -= res2[key] |
There was a problem hiding this comment.
While this change correctly fixes the KeyError, the logic can be simplified. The any() check on line 352 establishes an invariant: for any resource key with a requested value res2[key] > 0, we know that res1.get(key, 0) >= res2[key], which implies key must be present in res1. For resources where res2[key] == 0, no subtraction is needed. Therefore, we can simplify the condition inside the loop to only act on positive resource requests.
| if key in res1: | |
| res1[key] -= res2[key] | |
| if res2[key] > 0: | |
| res1[key] -= res2[key] |
iamjustinhsu
left a comment
There was a problem hiding this comment.
Do we need this check anywhere else?
| for key in res2: | ||
| res1[key] -= res2[key] | ||
| if key in res1: | ||
| res1[key] -= res2[key] |
Not that I know of. |
… you request 0 GPUs on CPU-only cluster (ray-project#59516) Cherry-pick of ray-project#59514 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…0 GPUs on CPU-only cluster (ray-project#59514) If you request zero GPUs from the autoscaling coordinator but GPUs don't exist on the cluster, the autoscaling coordinator crashes. This PR fixes that bug. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
…0 GPUs on CPU-only cluster (ray-project#59514) If you request zero GPUs from the autoscaling coordinator but GPUs don't exist on the cluster, the autoscaling coordinator crashes. This PR fixes that bug. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…0 GPUs on CPU-only cluster (ray-project#59514) If you request zero GPUs from the autoscaling coordinator but GPUs don't exist on the cluster, the autoscaling coordinator crashes. This PR fixes that bug. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
If you request zero GPUs from the autoscaling coordinator but GPUs don't exist on the cluster, the autoscaling coordinator crashes.
This PR fixes that bug.