-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[core] EC2 Autoscaler race condition #51861
Copy link
Copy link
Closed
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesautoscaler related issues
Description
What happened + What you expected to happen
Ray Ec2 autoscaler can get into a state where it 'looses' track of an instance, but that instance is present in AWS, and it causes an exception, which leads to the autoscaler to get stuck:
Here's the exception:
2025-03-31 11:50:57.575
AssertionError: Invalid instance id i-04074e6d2eba88b27
2025-03-31 11:50:57.575
^^^^^^^^^^^^^^^^^
2025-03-31 11:50:57.575
assert len(matches) == 1, "Invalid instance id {}".format(node_id)
2025-03-31 11:50:57.575
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 598, in _get_node
2025-03-31 11:50:57.575
^^^^^^^^^^^^^^^^^^^^^^^
2025-03-31 11:50:57.575
return self._get_node(node_id)
2025-03-31 11:50:57.575
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 606, in _get_cached_node
2025-03-31 11:50:57.575
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-31 11:50:57.575
node = self._get_cached_node(node_id)
2025-03-31 11:50:57.575
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 195, in internal_ip
2025-03-31 11:50:57.574
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-31 11:50:57.574
self.load_metrics.mark_active(self.provider.internal_ip(node_id))
2025-03-31 11:50:57.574
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 775, in process_completed_updates
2025-03-31 11:50:57.574
self.process_completed_updates()
2025-03-31 11:50:57.574
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 439, in _update
2025-03-31 11:50:57.574
self._update()
2025-03-31 11:50:57.574
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 377, in update
2025-03-31 11:50:57.574
Traceback (most recent call last):
2025-03-31 11:50:57.574
2025-03-31 11:50:57,505 ERROR autoscaler.py:380 -- StandardAutoscaler: Error during autoscaling.
Versions / Dependencies
Ray version: 2.40.0
Reproduction script
Run a ray job hundres/thousands of tasks and let cluster scale up quickly to hundreds or thousands of nodes. We observe this happening with clusters as small as ~150 nodes.
Issue Severity
High: It blocks me from completing my task.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesautoscaler related issues