[core][autoscaler] let AWS node provider retry instance retrieval to handle eventual consistency#52200
Conversation
…handle eventual consistency Signed-off-by: Rueian <rueiancsie@gmail.com>
7ccdd89 to
62c9a66
Compare
|
Hi @kevin85421, would you mind reviewing this? This added retries for mitigating the eventual behavior of the AWS EC2 API. cc @jjyao |
Signed-off-by: Rueian <rueiancsie@gmail.com>
| matches = list(self.ec2.instances.filter(InstanceIds=[node_id])) | ||
| if len(matches) == 1: | ||
| return matches[0] | ||
| cli_logger.warning( |
There was a problem hiding this comment.
Update the comment to make it more actionable for users. For example,
"Attempt to fetch EC2 instance associated with instance ID xxxxx. Get yy matched EC2 instances. Will retry after 1 second ..."
There was a problem hiding this comment.
How about Unable to find the id i-0000xxxooob from 0 EC2 instances. Will retry after 1 second (1/12).
There was a problem hiding this comment.
-
What does the
matchesvariable refer to? I would expect that it is a list of EC2 instances that fulfill the filterfilter(InstanceIds=[node_id]). Is my understanding correct? If so, "from" is a bit weird for me. -
"(1/12)" is not easy to understand for users.
"Attempt to fetch EC2 instances that have instance ID xxxxx. Got xxx matching EC2 instances. Will retry after yyy second. This is retry number zzz, and the maximum number of retries is qqq."
|
Just FYI (no change requested, not merge blocker), I wrote a retry util which supports exponential backoff and jitter: https://github.com/ray-project/ray/blob/master/release/ray_release/retry.py |
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
…handle eventual consistency (ray-project#52200) Signed-off-by: Rueian <rueiancsie@gmail.com> Signed-off-by: Steve Han <stevehan2001@gmail.com>
Why are these changes needed?
Fixes #51861. The issue fails on the assertion:
ray/python/ray/autoscaler/_private/aws/node_provider.py
Lines 597 to 598 in 574cf65
The assertion can only fail on
len(matches) == 0because AWS will never return duplicate instances in the API.As documented here, the AWS EC2 API follows an eventual consistency model. It is possible that we can't query an instance by using
ec2.instances.filter(InstanceIds=[node_id])immediately after the instance is created. We need to retry for a few seconds. This PR reuses theBOTO_MAX_RETRIESenvironment variable, which is 12 by default, to retry the operation. That's up to 12 seconds.Related issue number
Closes #51861
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.