[Data][Cherry-pick] Fix bug where AutoscalingCoordinator crashes if you request 0 GPUs on CPU-only cluster #59516
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Code Review
The pull request fixes a KeyError in AutoscalingCoordinator that occurred when requesting a resource with a value of 0 on a cluster where that resource type is not present (e.g., requesting 0 GPUs on a CPU-only cluster). The fix in _maybe_subtract_resources correctly prevents this crash by checking for the key's existence before attempting subtraction. A regression test has been added to cover this scenario. My review includes a suggestion to make this new test more robust by explicitly mocking the cluster state to ensure it runs deterministically, regardless of the test environment's hardware.
| def test_coordinator_accepts_zero_resource_for_missing_resource_type( | ||
| teardown_autoscaling_coordinator, | ||
| ): | ||
| # This is a regression test for a bug where the coordinator crashes when you request | ||
| # a resource type (e.g., GPU: 0) that doesn't exist on the cluster. | ||
| coordinator = DefaultAutoscalingCoordinator() | ||
|
|
||
| coordinator.request_resources( | ||
| requester_id="spam", resources=[{"CPU": 1, "GPU": 0}], expire_after_s=1 | ||
| ) | ||
|
|
||
| assert coordinator.get_allocated_resources("spam") == [{"CPU": 1, "GPU": 0}] |
There was a problem hiding this comment.
This new regression test is a great addition. To make it more robust and not dependent on the hardware of the execution environment, I suggest mocking ray.nodes() to simulate a cluster without GPUs. This ensures the test deterministically covers the intended scenario where a requested resource type is absent from the cluster.
| def test_coordinator_accepts_zero_resource_for_missing_resource_type( | |
| teardown_autoscaling_coordinator, | |
| ): | |
| # This is a regression test for a bug where the coordinator crashes when you request | |
| # a resource type (e.g., GPU: 0) that doesn't exist on the cluster. | |
| coordinator = DefaultAutoscalingCoordinator() | |
| coordinator.request_resources( | |
| requester_id="spam", resources=[{"CPU": 1, "GPU": 0}], expire_after_s=1 | |
| ) | |
| assert coordinator.get_allocated_resources("spam") == [{"CPU": 1, "GPU": 0}] | |
| def test_coordinator_accepts_zero_resource_for_missing_resource_type( | |
| teardown_autoscaling_coordinator, | |
| ): | |
| # This is a regression test for a bug where the coordinator crashes when you request | |
| # a resource type (e.g., GPU: 0) that doesn't exist on the cluster. | |
| cluster_with_no_gpus = [ | |
| { | |
| "Resources": {"CPU": 2}, | |
| "Alive": True, | |
| } | |
| ] | |
| with patch("ray.nodes", return_value=cluster_with_no_gpus): | |
| coordinator = DefaultAutoscalingCoordinator() | |
| coordinator.request_resources( | |
| requester_id="spam", resources=[{"CPU": 1, "GPU": 0}], expire_after_s=1 | |
| ) | |
| assert coordinator.get_allocated_resources("spam") == [{"CPU": 1, "GPU": 0}] |
… you request 0 GPUs on CPU-only cluster (ray-project#59516) Cherry-pick of ray-project#59514 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Cherry-pick of #59514