Add handling for 429 Retry-After errors in httpcli#51743
Conversation
There was a problem hiding this comment.
We should add some attributes to the corresponding trace span in request context if there is one for many of these error paths/outcomes (or create a new child span) - would be helpful for debugging without introducing some crazy verbose logging
There was a problem hiding this comment.
I added the header content to it.
There was a problem hiding this comment.
I think this is a logical approach and a positive change that is worth doing.
My only concern is that if we were counting on those retires and it seems intentional based on the unit tests, previously there were 20 attempts with delays in between so it covered a much larger period of time to allow success where as now it will fail if it doesn't fit within 1 max delay window.
There was a problem hiding this comment.
yeah.. I am scared of changing this core component 😀
On the bright side, we only change the behavior here if the retry-after header is set. Not sure if that helps anyhow but I can't really see how this would change a lot over the today's behavior except if in the next 20 retries it would be acceptable again but not within 3s.. 🤔
There was a problem hiding this comment.
I say go for it but it might be worth throwing in a flag to return true in the 429 path just in case so a revert wouldn't require a patch.
There was a problem hiding this comment.
actually, we can set the configurable limit to a very large number, like 2 minutes
There was a problem hiding this comment.
Perfect we just need to remember that now if something strange happens.
There was a problem hiding this comment.
I will add a changelog entry
This PR adds some basic handling of the Retry-After for 429 errors. Right now, we retry every 429 error, regardless of if the Retry-After header tells us to wait another hour, where 20 retry attempts would not help at all. We do that by parsing the header and checking if within the next 3s we would regain access. If so, we keep trying, otherwise, we stop retrying.
3a16dfd to
6be05d6
Compare
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 5278384...fdf9dc3.
|
|
@camdencheek this fixes an issue where we would retry way too often before we accept that we are rate limited by upstream. It didn't blow up in S2, k8s and dotcom in the past two days but if you feel this one might be too risky I understand then feel free to omit it and ping me here please :) |
This PR adds some basic handling of the Retry-After for 429 errors. Right now, we retry every 429 error, regardless of if the Retry-After header tells us to wait another hour, where 20 retry attempts would not help at all. We do that by parsing the header and checking if within the next 3s we would regain access. If so, we keep trying, otherwise, we stop retrying. ## Test plan Verified that with a valid retry-after header this correctly stops retrying. Also added a test suite for all the possible cases around 429 errors. (cherry picked from commit a9917a7)
This PR adds some basic handling of the Retry-After for 429 errors. Right now, we retry every 429 error, regardless of if the Retry-After header tells us to wait another hour, where 20 retry attempts would not help at all. We do that by parsing the header and checking if within the next 3s we would regain access. If so, we keep trying, otherwise, we stop retrying.
Test plan
Verified that with a valid retry-after header this correctly stops retrying. Also added a test suite for all the possible cases around 429 errors.