Skip to content

[core][autoscaler] let AWS node provider retry instance retrieval to handle eventual consistency#52200

Merged
jjyao merged 6 commits intoray-project:masterfrom
rueian:retry-aws-node-provider
Apr 11, 2025
Merged

[core][autoscaler] let AWS node provider retry instance retrieval to handle eventual consistency#52200
jjyao merged 6 commits intoray-project:masterfrom
rueian:retry-aws-node-provider

Conversation

@rueian
Copy link
Copy Markdown
Contributor

@rueian rueian commented Apr 10, 2025

Why are these changes needed?

Fixes #51861. The issue fails on the assertion:

matches = list(self.ec2.instances.filter(InstanceIds=[node_id]))
assert len(matches) == 1, "Invalid instance id {}".format(node_id)

The assertion can only fail on len(matches) == 0 because AWS will never return duplicate instances in the API.

As documented here, the AWS EC2 API follows an eventual consistency model. It is possible that we can't query an instance by using ec2.instances.filter(InstanceIds=[node_id]) immediately after the instance is created. We need to retry for a few seconds. This PR reuses the BOTO_MAX_RETRIES environment variable, which is 12 by default, to retry the operation. That's up to 12 seconds.

Related issue number

Closes #51861

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…handle eventual consistency

Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian force-pushed the retry-aws-node-provider branch from 7ccdd89 to 62c9a66 Compare April 10, 2025 05:02
@rueian rueian marked this pull request as ready for review April 10, 2025 16:26
@rueian rueian requested a review from a team as a code owner April 10, 2025 16:26
@rueian
Copy link
Copy Markdown
Contributor Author

rueian commented Apr 10, 2025

Hi @kevin85421, would you mind reviewing this? This added retries for mitigating the eventual behavior of the AWS EC2 API. cc @jjyao

Signed-off-by: Rueian <rueiancsie@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Apr 10, 2025
matches = list(self.ec2.instances.filter(InstanceIds=[node_id]))
if len(matches) == 1:
return matches[0]
cli_logger.warning(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment to make it more actionable for users. For example,

"Attempt to fetch EC2 instance associated with instance ID xxxxx. Get yy matched EC2 instances. Will retry after 1 second ..."

Copy link
Copy Markdown
Contributor Author

@rueian rueian Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about Unable to find the id i-0000xxxooob from 0 EC2 instances. Will retry after 1 second (1/12).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What does the matches variable refer to? I would expect that it is a list of EC2 instances that fulfill the filter filter(InstanceIds=[node_id]). Is my understanding correct? If so, "from" is a bit weird for me.

  2. "(1/12)" is not easy to understand for users.

"Attempt to fetch EC2 instances that have instance ID xxxxx. Got xxx matching EC2 instances. Will retry after yyy second. This is retry number zzz, and the maximum number of retries is qqq."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@dentiny
Copy link
Copy Markdown
Contributor

dentiny commented Apr 11, 2025

Just FYI (no change requested, not merge blocker), I wrote a retry util which supports exponential backoff and jitter: https://github.com/ray-project/ray/blob/master/release/ray_release/retry.py

rueian added 4 commits April 10, 2025 18:58
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
@jjyao jjyao merged commit 7e29ff8 into ray-project:master Apr 11, 2025
5 checks passed
han-steve pushed a commit to han-steve/ray that referenced this pull request Apr 11, 2025
…handle eventual consistency (ray-project#52200)

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Steve Han <stevehan2001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[core] EC2 Autoscaler race condition

5 participants