[Autoscaler][V2] Autoscaler fails to delete idle KubeRay Pod

### What happened + What you expected to happen

We've seen an issue where the v2 autoscaler loops and fails to delete a Pod with:
```
ERROR autoscaler.py:200 --
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/autoscaler.py", line 185, in update_autoscaling_state
return Reconciler.reconcile(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 119, in reconcile
Reconciler._step_next(
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 286, in _step_next
Reconciler._terminate_instances(instance_manager=instance_manager)
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1203, in _terminate_instances
Reconciler._update_instance_manager(instance_manager, version, updates)
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 615, in _update_instance_manager
reply = instance_manager.update_instance_manager_state(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 126, in update_instance_manager_state
subscriber.notify(request.updates)
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 44, in notify
self._terminate_instances(new_terminations)
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 61, in _terminate_instances
self._cloud_provider.terminate(
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 123, in terminate
scale_request = self._initialize_scale_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 253, in _initialize_scale_request
assert num_workers_dict[to_delete_instance.node_type] >= 0
```

I believe this is a result of https://github.com/ray-project/ray/pull/48909 which changed how the value of `num_workers_dict` is calculated, from:
```
num_workers_dict[cur_instance.node_type] += 1
```
to ->
```
num_workers_dict[node_type] = max(
    worker_group["replicas"], worker_group["minReplicas"]
)
```

There appears to be a race condition if KubeRay has already updated `replicas` but the instance is still included in the `worker_to_delete` set, since from the Ray autoscaler side it appears as though we're trying to delete more of a node type than currently exist.

### Versions / Dependencies

Ray Version: 2.44.0
KubeRay version: 1.3.0

### Reproduction script

N/A since this is a race condition, could just create an autoscaling RayCluster and repeatedly let idle timeout occur until the error appears

### Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autoscaler][V2] Autoscaler fails to delete idle KubeRay Pod #52264

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Autoscaler][V2] Autoscaler fails to delete idle KubeRay Pod #52264

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions