Skip to content

[KubeRay] AssertionError when trying to scale down #56985

@atombender

Description

@atombender

What happened + What you expected to happen

Using KubeRay on GKE, I have just run a Ray job that caused the autoscaler to scale up to 13 pods. The job finished hours ago, and the dashboard shows the nodes (workers) as dead, but Kubernetes pods are still around.

The autoscaler log is full of failures like these which repeat over and over:

2025-09-27 00:28:33,541 - INFO - Update instance TERMINATING->TERMINATION_FAILED (id=95047a8c-c9e2-48dc-bca3-dd42f6a3a868, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-jthdk, ray_id=d9cd91bc4b3a7bef921961eb9e690c6e9d5118dd9cda640e72725163): timeout=300s at status TERMINATING
2025-09-27 00:28:33,541	INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATION_FAILED (id=95047a8c-c9e2-48dc-bca3-dd42f6a3a868, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-jthdk, ray_id=d9cd91bc4b3a7bef921961eb9e690c6e9d5118dd9cda640e72725163): timeout=300s at status TERMINATING
2025-09-27 00:28:33,541 - INFO - Update instance TERMINATING->TERMINATION_FAILED (id=ce8e1898-6feb-448d-bfbd-10d158e88e8f, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-xj75v, ray_id=6ade8bba181789ac7d181eced69172e61db9dbb1f83645e15e2db7cb): timeout=300s at status TERMINATING
2025-09-27 00:28:33,541	INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATION_FAILED (id=ce8e1898-6feb-448d-bfbd-10d158e88e8f, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-xj75v, ray_id=6ade8bba181789ac7d181eced69172e61db9dbb1f83645e15e2db7cb): timeout=300s at status TERMINATING
2025-09-27 00:28:33,544 - INFO - Node 470b981d2c73c53b0909e0e5f9882db70bafa4802bfa9bbd675a0e5f (idle for 26362.392 secs) belongs to node_type worker-group and is required by min_worker_nodes, skipping idle termination.
2025-09-27 00:28:33,544	INFO scheduler.py:1702 -- Node 470b981d2c73c53b0909e0e5f9882db70bafa4802bfa9bbd675a0e5f (idle for 26362.392 secs) belongs to node_type worker-group and is required by min_worker_nodes, skipping idle termination.
2025-09-27 00:28:33,547 - INFO - Update instance TERMINATION_FAILED->TERMINATING (id=95047a8c-c9e2-48dc-bca3-dd42f6a3a868, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-jthdk, ray_id=d9cd91bc4b3a7bef921961eb9e690c6e9d5118dd9cda640e72725163): terminating instance from TERMINATION_FAILED
2025-09-27 00:28:33,547	INFO instance_manager.py:263 -- Update instance TERMINATION_FAILED->TERMINATING (id=95047a8c-c9e2-48dc-bca3-dd42f6a3a868, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-jthdk, ray_id=d9cd91bc4b3a7bef921961eb9e690c6e9d5118dd9cda640e72725163): terminating instance from TERMINATION_FAILED
2025-09-27 00:28:33,547 - INFO - Update instance TERMINATION_FAILED->TERMINATING (id=ce8e1898-6feb-448d-bfbd-10d158e88e8f, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-xj75v, ray_id=6ade8bba181789ac7d181eced69172e61db9dbb1f83645e15e2db7cb): terminating instance from TERMINATION_FAILED
2025-09-27 00:28:33,547	INFO instance_manager.py:263 -- Update instance TERMINATION_FAILED->TERMINATING (id=ce8e1898-6feb-448d-bfbd-10d158e88e8f, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-xj75v, ray_id=6ade8bba181789ac7d181eced69172e61db9dbb1f83645e15e2db7cb): terminating instance from TERMINATION_FAILED
2025-09-27 00:28:33,548 - INFO - Terminating worker pods: ['raycluster-worker-group-worker-jthdk', 'raycluster-worker-group-worker-xj75v']
2025-09-27 00:28:33,548	INFO cloud_provider.py:121 -- Terminating worker pods: ['raycluster-worker-group-worker-jthdk', 'raycluster-worker-group-worker-xj75v']
2025-09-27 00:28:33,607 - INFO - Listing pods for RayCluster raycluster in namespace drift at pods resource version >= 1758926799580431009.
2025-09-27 00:28:33,607	INFO cloud_provider.py:466 -- Listing pods for RayCluster raycluster in namespace drift at pods resource version >= 1758926799580431009.
2025-09-27 00:28:33,644 - INFO - Fetched pod data at resource version 1758958113463330000.
2025-09-27 00:28:33,644	INFO cloud_provider.py:484 -- Fetched pod data at resource version 1758958113463330000.
2025-09-27 00:28:33,646 - ERROR -
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 122, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._terminate_instances(instance_manager=instance_manager)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1228, in _terminate_instances
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 622, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 126, in update_instance_manager_state
    subscriber.notify(request.updates)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 44, in notify
    self._terminate_instances(new_terminations)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 61, in _terminate_instances
    self._cloud_provider.terminate(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 123, in terminate
    scale_request = self._initialize_scale_request(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 257, in _initialize_scale_request
    assert num_workers_dict[to_delete_instance.node_type] >= 0
AssertionError
2025-09-27 00:28:33,646	ERROR autoscaler.py:215 --
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 122, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
    Reconciler._terminate_instances(instance_manager=instance_manager)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1228, in _terminate_instances
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 622, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 126, in update_instance_manager_state
    subscriber.notify(request.updates)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 44, in notify
    self._terminate_instances(new_terminations)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 61, in _terminate_instances
    self._cloud_provider.terminate(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 123, in terminate
    scale_request = self._initialize_scale_request(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 257, in _initialize_scale_request
    assert num_workers_dict[to_delete_instance.node_type] >= 0
AssertionError
2025-09-27 00:28:33,649 - WARNING - No autoscaling state to report.
2025-09-27 00:28:33,649	WARNING monitor.py:178 -- No autoscaling state to report.

Versions / Dependencies

Ray 2.49.2 (docker.io/rayproject/ray:2.49.2-py310).

I’m using RAY_ENABLE_AUTOSCALER_V2=1.

Reproduction script

N/A

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Corestability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions