Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Add handling for 429 Retry-After errors in httpcli#51743

Merged
eseliger merged 6 commits into
mainfrom
es/retry-after-handling-httpcli
May 10, 2023
Merged

Add handling for 429 Retry-After errors in httpcli#51743
eseliger merged 6 commits into
mainfrom
es/retry-after-handling-httpcli

Conversation

@eseliger

@eseliger eseliger commented May 10, 2023

Copy link
Copy Markdown
Member

This PR adds some basic handling of the Retry-After for 429 errors. Right now, we retry every 429 error, regardless of if the Retry-After header tells us to wait another hour, where 20 retry attempts would not help at all. We do that by parsing the header and checking if within the next 3s we would regain access. If so, we keep trying, otherwise, we stop retrying.

Test plan

Verified that with a valid retry-after header this correctly stops retrying. Also added a test suite for all the possible cases around 429 errors.

@cla-bot cla-bot Bot added the cla-signed label May 10, 2023
Comment thread internal/httpcli/client.go Outdated
Comment on lines 613 to 642

@bobheadxi bobheadxi May 10, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some attributes to the corresponding trace span in request context if there is one for many of these error paths/outcomes (or create a new child span) - would be helpful for debugging without introducing some crazy verbose logging

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the header content to it.

Comment thread internal/httpcli/client.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a logical approach and a positive change that is worth doing.

My only concern is that if we were counting on those retires and it seems intentional based on the unit tests, previously there were 20 attempts with delays in between so it covered a much larger period of time to allow success where as now it will fail if it doesn't fit within 1 max delay window.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah.. I am scared of changing this core component 😀
On the bright side, we only change the behavior here if the retry-after header is set. Not sure if that helps anyhow but I can't really see how this would change a lot over the today's behavior except if in the next 20 retries it would be acceptable again but not within 3s.. 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say go for it but it might be worth throwing in a flag to return true in the 429 path just in case so a revert wouldn't require a patch.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good call!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, we can set the configurable limit to a very large number, like 2 minutes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect we just need to remember that now if something strange happens.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a changelog entry

This PR adds some basic handling of the Retry-After for 429 errors. Right now, we retry every 429 error, regardless of if the Retry-After header tells us to wait another hour, where 20 retry attempts would not help at all.
We do that by parsing the header and checking if within the next 3s we would regain access. If so, we keep trying, otherwise, we stop retrying.
@eseliger eseliger force-pushed the es/retry-after-handling-httpcli branch from 3a16dfd to 6be05d6 Compare May 10, 2023 21:43
@eseliger eseliger marked this pull request as ready for review May 10, 2023 22:19
@eseliger eseliger requested review from a team, Strum355, bobheadxi, chwarwick and unknwon May 10, 2023 22:19
@sourcegraph-bot

sourcegraph-bot commented May 10, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff 5278384...fdf9dc3.

Notify File(s)
@keegancsmith internal/httpcli/BUILD.bazel
internal/httpcli/client.go
internal/httpcli/client_test.go

@chwarwick chwarwick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@eseliger eseliger merged commit a9917a7 into main May 10, 2023
@eseliger eseliger deleted the es/retry-after-handling-httpcli branch May 10, 2023 23:40
@eseliger

Copy link
Copy Markdown
Member Author

@camdencheek this fixes an issue where we would retry way too often before we accept that we are rate limited by upstream. It didn't blow up in S2, k8s and dotcom in the past two days but if you feel this one might be too risky I understand then feel free to omit it and ping me here please :)

camdencheek pushed a commit that referenced this pull request May 16, 2023
This PR adds some basic handling of the Retry-After for 429 errors.
Right now, we retry every 429 error, regardless of if the Retry-After
header tells us to wait another hour, where 20 retry attempts would not
help at all. We do that by parsing the header and checking if within the
next 3s we would regain access. If so, we keep trying, otherwise, we
stop retrying.

## Test plan

Verified that with a valid retry-after header this correctly stops
retrying. Also added a test suite for all the possible cases around 429
errors.

(cherry picked from commit a9917a7)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants