Skip to content

[core] EC2 Autoscaler race condition #51861

@akazakov

Description

@akazakov

What happened + What you expected to happen

Ray Ec2 autoscaler can get into a state where it 'looses' track of an instance, but that instance is present in AWS, and it causes an exception, which leads to the autoscaler to get stuck:

Here's the exception:

2025-03-31 11:50:57.575
AssertionError: Invalid instance id i-04074e6d2eba88b27

2025-03-31 11:50:57.575
           ^^^^^^^^^^^^^^^^^
2025-03-31 11:50:57.575
    assert len(matches) == 1, "Invalid instance id {}".format(node_id)
2025-03-31 11:50:57.575
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 598, in _get_node
2025-03-31 11:50:57.575
           ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-31 11:50:57.575
    return self._get_node(node_id)
2025-03-31 11:50:57.575
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 606, in _get_cached_node
2025-03-31 11:50:57.575
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-31 11:50:57.575
    node = self._get_cached_node(node_id)
2025-03-31 11:50:57.575
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 195, in internal_ip
2025-03-31 11:50:57.574
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-31 11:50:57.574
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
2025-03-31 11:50:57.574
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 775, in process_completed_updates
2025-03-31 11:50:57.574
    self.process_completed_updates()
2025-03-31 11:50:57.574
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 439, in _update
2025-03-31 11:50:57.574
    self._update()
2025-03-31 11:50:57.574
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 377, in update
2025-03-31 11:50:57.574
Traceback (most recent call last):
2025-03-31 11:50:57.574
2025-03-31 11:50:57,505 ERROR autoscaler.py:380 -- StandardAutoscaler: Error during autoscaling.

Versions / Dependencies

Ray version: 2.40.0

Reproduction script

Run a ray job hundres/thousands of tasks and let cluster scale up quickly to hundreds or thousands of nodes. We observe this happening with clusters as small as ~150 nodes.

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions