-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] DefaultClusterAutoscalerV2 raises KeyError: 'CPU' on nodes with 0 logical CPU resources #60166
Description
What happened + What you expected to happen
I encountered an issue when applying DefaultClusterAutoscalerV2 feature in my KubeRay cluster.
We run a mixed cluster with CPU-only and GPU-only nodes to ensure that generic CPU workloads do not trigger the provisioning of expensive GPU nodes. We are adopting DefaultClusterAutoscalerV2 to prevent Ray Data from aggressively scaling up excessive CPU nodes when GPU task is running and pending later cpu tasks.
However, I noticed that DefaultClusterAutoscalerV2 has a assumption that the "CPU" key is always present in the node resources. Since our GPU nodes are configured with 0 logical CPUs for isolation, this causes a crash.
Related code:
ray/python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Line 77 in f182131
| cpu=r["CPU"], gpu=r.get("GPU", 0), mem=r["memory"] |
I'm wondering if this is a bug or a misconfiguration on our end. If it's a bug, please let me know, and I'll be happy to open a PR for it.
Versions / Dependencies
ray=2.53.0
Reproduction script
None
Issue Severity
Medium: It is a significant difficulty but I can work around it.