-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issues
Description
What happened + What you expected to happen
The new cluster autoscaler works by checking if the global logical utilization is high, and if so, adding one node of each type.
To determine the different types of nodes, the autoscaler calls ray.nodes():
ray/python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Lines 124 to 140 in 6dc66d4
| """Get the unique node resource specs and their count in the cluster.""" | |
| # Filter out the head node. | |
| node_resources = [ | |
| node["Resources"] | |
| for node in ray.nodes() | |
| if node["Alive"] and "node:__internal_head__" not in node["Resources"] | |
| ] | |
| nodes_resource_spec_count = defaultdict(int) | |
| for r in node_resources: | |
| node_resource_spec = _NodeResourceSpec.of( | |
| cpu=r["CPU"], gpu=r.get("GPU", 0), mem=r["memory"] | |
| ) | |
| nodes_resource_spec_count[node_resource_spec] += 1 | |
| return nodes_resource_spec_count | |
The problem with this approach is that it can only detect alive nodes. If you start from zero worker nodes, the autoscaler won't be aware of that node type. To address this issue, we can use the Ray Core API introduced in #49568.
Outcome:
DefaultAutoscalerV2 should try to scale up node types even if there are currently zero of them.
Constraints:
- Use
ray._private.state.state.get_cluster_config()(you might also need to still useray.nodes()sinceget_cluster_config()doesn't give you counts)
Versions / Dependencies
Reproduction script
TODO
Issue Severity
None
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issues