Skip to content

[Data] DefaultClusterAutoscalerV2 raises KeyError: 'CPU' on nodes with 0 logical CPU resources #60166

@Wodswos

Description

@Wodswos

What happened + What you expected to happen

I encountered an issue when applying DefaultClusterAutoscalerV2 feature in my KubeRay cluster.

We run a mixed cluster with CPU-only and GPU-only nodes to ensure that generic CPU workloads do not trigger the provisioning of expensive GPU nodes. We are adopting DefaultClusterAutoscalerV2 to prevent Ray Data from aggressively scaling up excessive CPU nodes when GPU task is running and pending later cpu tasks.

However, I noticed that DefaultClusterAutoscalerV2 has a assumption that the "CPU" key is always present in the node resources. Since our GPU nodes are configured with 0 logical CPUs for isolation, this causes a crash.

Related code:

cpu=r["CPU"], gpu=r.get("GPU", 0), mem=r["memory"]

I'm wondering if this is a bug or a misconfiguration on our end. If it's a bug, please let me know, and I'll be happy to open a PR for it.

Versions / Dependencies

ray=2.53.0

Reproduction script

None

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Labels

bugSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions