-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[KubeRay] AssertionError when trying to scale down #56985
Copy link
Copy link
Closed
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corestability
Description
What happened + What you expected to happen
Using KubeRay on GKE, I have just run a Ray job that caused the autoscaler to scale up to 13 pods. The job finished hours ago, and the dashboard shows the nodes (workers) as dead, but Kubernetes pods are still around.
The autoscaler log is full of failures like these which repeat over and over:
2025-09-27 00:28:33,541 - INFO - Update instance TERMINATING->TERMINATION_FAILED (id=95047a8c-c9e2-48dc-bca3-dd42f6a3a868, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-jthdk, ray_id=d9cd91bc4b3a7bef921961eb9e690c6e9d5118dd9cda640e72725163): timeout=300s at status TERMINATING
2025-09-27 00:28:33,541 INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATION_FAILED (id=95047a8c-c9e2-48dc-bca3-dd42f6a3a868, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-jthdk, ray_id=d9cd91bc4b3a7bef921961eb9e690c6e9d5118dd9cda640e72725163): timeout=300s at status TERMINATING
2025-09-27 00:28:33,541 - INFO - Update instance TERMINATING->TERMINATION_FAILED (id=ce8e1898-6feb-448d-bfbd-10d158e88e8f, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-xj75v, ray_id=6ade8bba181789ac7d181eced69172e61db9dbb1f83645e15e2db7cb): timeout=300s at status TERMINATING
2025-09-27 00:28:33,541 INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATION_FAILED (id=ce8e1898-6feb-448d-bfbd-10d158e88e8f, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-xj75v, ray_id=6ade8bba181789ac7d181eced69172e61db9dbb1f83645e15e2db7cb): timeout=300s at status TERMINATING
2025-09-27 00:28:33,544 - INFO - Node 470b981d2c73c53b0909e0e5f9882db70bafa4802bfa9bbd675a0e5f (idle for 26362.392 secs) belongs to node_type worker-group and is required by min_worker_nodes, skipping idle termination.
2025-09-27 00:28:33,544 INFO scheduler.py:1702 -- Node 470b981d2c73c53b0909e0e5f9882db70bafa4802bfa9bbd675a0e5f (idle for 26362.392 secs) belongs to node_type worker-group and is required by min_worker_nodes, skipping idle termination.
2025-09-27 00:28:33,547 - INFO - Update instance TERMINATION_FAILED->TERMINATING (id=95047a8c-c9e2-48dc-bca3-dd42f6a3a868, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-jthdk, ray_id=d9cd91bc4b3a7bef921961eb9e690c6e9d5118dd9cda640e72725163): terminating instance from TERMINATION_FAILED
2025-09-27 00:28:33,547 INFO instance_manager.py:263 -- Update instance TERMINATION_FAILED->TERMINATING (id=95047a8c-c9e2-48dc-bca3-dd42f6a3a868, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-jthdk, ray_id=d9cd91bc4b3a7bef921961eb9e690c6e9d5118dd9cda640e72725163): terminating instance from TERMINATION_FAILED
2025-09-27 00:28:33,547 - INFO - Update instance TERMINATION_FAILED->TERMINATING (id=ce8e1898-6feb-448d-bfbd-10d158e88e8f, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-xj75v, ray_id=6ade8bba181789ac7d181eced69172e61db9dbb1f83645e15e2db7cb): terminating instance from TERMINATION_FAILED
2025-09-27 00:28:33,547 INFO instance_manager.py:263 -- Update instance TERMINATION_FAILED->TERMINATING (id=ce8e1898-6feb-448d-bfbd-10d158e88e8f, type=worker-group, cloud_instance_id=raycluster-worker-group-worker-xj75v, ray_id=6ade8bba181789ac7d181eced69172e61db9dbb1f83645e15e2db7cb): terminating instance from TERMINATION_FAILED
2025-09-27 00:28:33,548 - INFO - Terminating worker pods: ['raycluster-worker-group-worker-jthdk', 'raycluster-worker-group-worker-xj75v']
2025-09-27 00:28:33,548 INFO cloud_provider.py:121 -- Terminating worker pods: ['raycluster-worker-group-worker-jthdk', 'raycluster-worker-group-worker-xj75v']
2025-09-27 00:28:33,607 - INFO - Listing pods for RayCluster raycluster in namespace drift at pods resource version >= 1758926799580431009.
2025-09-27 00:28:33,607 INFO cloud_provider.py:466 -- Listing pods for RayCluster raycluster in namespace drift at pods resource version >= 1758926799580431009.
2025-09-27 00:28:33,644 - INFO - Fetched pod data at resource version 1758958113463330000.
2025-09-27 00:28:33,644 INFO cloud_provider.py:484 -- Fetched pod data at resource version 1758958113463330000.
2025-09-27 00:28:33,646 - ERROR -
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
return Reconciler.reconcile(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 122, in reconcile
Reconciler._step_next(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
Reconciler._terminate_instances(instance_manager=instance_manager)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1228, in _terminate_instances
Reconciler._update_instance_manager(instance_manager, version, updates)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 622, in _update_instance_manager
reply = instance_manager.update_instance_manager_state(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 126, in update_instance_manager_state
subscriber.notify(request.updates)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 44, in notify
self._terminate_instances(new_terminations)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 61, in _terminate_instances
self._cloud_provider.terminate(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 123, in terminate
scale_request = self._initialize_scale_request(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 257, in _initialize_scale_request
assert num_workers_dict[to_delete_instance.node_type] >= 0
AssertionError
2025-09-27 00:28:33,646 ERROR autoscaler.py:215 --
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
return Reconciler.reconcile(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 122, in reconcile
Reconciler._step_next(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 289, in _step_next
Reconciler._terminate_instances(instance_manager=instance_manager)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1228, in _terminate_instances
Reconciler._update_instance_manager(instance_manager, version, updates)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 622, in _update_instance_manager
reply = instance_manager.update_instance_manager_state(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 126, in update_instance_manager_state
subscriber.notify(request.updates)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 44, in notify
self._terminate_instances(new_terminations)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/subscribers/cloud_instance_updater.py", line 61, in _terminate_instances
self._cloud_provider.terminate(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 123, in terminate
scale_request = self._initialize_scale_request(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py", line 257, in _initialize_scale_request
assert num_workers_dict[to_delete_instance.node_type] >= 0
AssertionError
2025-09-27 00:28:33,649 - WARNING - No autoscaling state to report.
2025-09-27 00:28:33,649 WARNING monitor.py:178 -- No autoscaling state to report.
Versions / Dependencies
Ray 2.49.2 (docker.io/rayproject/ray:2.49.2-py310).
I’m using RAY_ENABLE_AUTOSCALER_V2=1.
Reproduction script
N/A
Issue Severity
High: It blocks me from completing my task.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corestability