Skip to content

[Data] DefaultAutoscalerV2 doesn't scale nodes from zero #59682

@bveeramani

Description

@bveeramani

What happened + What you expected to happen

The new cluster autoscaler works by checking if the global logical utilization is high, and if so, adding one node of each type.

To determine the different types of nodes, the autoscaler calls ray.nodes():

"""Get the unique node resource specs and their count in the cluster."""
# Filter out the head node.
node_resources = [
node["Resources"]
for node in ray.nodes()
if node["Alive"] and "node:__internal_head__" not in node["Resources"]
]
nodes_resource_spec_count = defaultdict(int)
for r in node_resources:
node_resource_spec = _NodeResourceSpec.of(
cpu=r["CPU"], gpu=r.get("GPU", 0), mem=r["memory"]
)
nodes_resource_spec_count[node_resource_spec] += 1
return nodes_resource_spec_count
.

The problem with this approach is that it can only detect alive nodes. If you start from zero worker nodes, the autoscaler won't be aware of that node type. To address this issue, we can use the Ray Core API introduced in #49568.

Outcome:

DefaultAutoscalerV2 should try to scale up node types even if there are currently zero of them.

Constraints:

  • Use ray._private.state.state.get_cluster_config() (you might also need to still use ray.nodes() since get_cluster_config() doesn't give you counts)

Versions / Dependencies

6dc66d4

Reproduction script

TODO

Issue Severity

None

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions