Skip to content

[Autoscaler][V2] Autoscaler fails to delete idle KubeRay Pod #52264

@ryanaoleary

Description

@ryanaoleary

What happened + What you expected to happen

We've seen an issue where the v2 autoscaler loops and fails to delete a Pod with:

ERROR autoscaler.py:200 --
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/autoscaler.py", line 185, in update_autoscaling_state
return Reconciler.reconcile(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 119, in reconcile
Reconciler._step_next(
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 286, in _step_next
Reconciler._terminate_instances(instance_manager=instance_manager)
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1203, in _terminate_instances
Reconciler._update_instance_manager(instance_manager, version, updates)
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 615, in _update_instance_manager
reply = instance_manager.update_instance_manager_state(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 126, in update_instance_manager_state
subscriber.notify(request.updates)
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 44, in notify
self._terminate_instances(new_terminations)
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 61, in _terminate_instances
self._cloud_provider.terminate(
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 123, in terminate
scale_request = self._initialize_scale_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 253, in _initialize_scale_request
assert num_workers_dict[to_delete_instance.node_type] >= 0

I believe this is a result of #48909 which changed how the value of num_workers_dict is calculated, from:

num_workers_dict[cur_instance.node_type] += 1

to ->

num_workers_dict[node_type] = max(
    worker_group["replicas"], worker_group["minReplicas"]
)

There appears to be a race condition if KubeRay has already updated replicas but the instance is still included in the worker_to_delete set, since from the Ray autoscaler side it appears as though we're trying to delete more of a node type than currently exist.

Versions / Dependencies

Ray Version: 2.44.0
KubeRay version: 1.3.0

Reproduction script

N/A since this is a race condition, could just create an autoscaling RayCluster and repeatedly let idle timeout occur until the error appears

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tcommunity-backlogcore-autoscalerautoscaler related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions