feat: exponential backoff for DigitalOcean API calls#8205
feat: exponential backoff for DigitalOcean API calls#8205dunglas wants to merge 5 commits intocert-manager:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @dunglas. Thanks for your PR. I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
There was a problem hiding this comment.
Pull Request Overview
This PR adds exponential backoff retry logic to DigitalOcean API calls to prevent infinite retry loops when rate limits are hit. The change configures the DigitalOcean client to automatically retry failed requests (429/500 errors) with exponential backoff, limiting retries to a maximum of 3 attempts.
Key changes:
- Enabled exponential backoff for DigitalOcean API client with a retry limit of 3 attempts
- Refactored client initialization to use inline option syntax for better readability
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
I didn't e2e-tested this but I know how to write a test doing this. I'll add it. |
98af1cb to
4855c23
Compare
|
E2E test added. |
|
/retest |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| client, err := godo.New(c, clientOpts...) | ||
| client, err := godo.New(c, | ||
| godo.SetUserAgent(userAgent), | ||
| godo.WithRetryAndBackoffs(godo.RetryConfig{RetryMax: math.MaxInt}), |
There was a problem hiding this comment.
Setting RetryMax to math.MaxInt creates an effectively infinite retry loop, which contradicts the PR's goal of preventing infinite retries. Consider using a reasonable finite value like 10-20 retries to avoid prolonged blocking operations.
| godo.WithRetryAndBackoffs(godo.RetryConfig{RetryMax: math.MaxInt}), | |
| godo.WithRetryAndBackoffs(godo.RetryConfig{RetryMax: 10}), |
There was a problem hiding this comment.
I tend to agree with Copilot here. What is the benefit of having an infinite retry loop?
There was a problem hiding this comment.
The infinite retry loop is already there because cert-manager automatically requeue failed requests.
The main benefit of looping here is to wait before retrying. The native requeue mechanism of cert-manager doesn't.
Without this, the backoff isn't exponential anymore.
There was a problem hiding this comment.
Having an infinite loop here would block the reconcile, which in turn would prevent other resources being reconciled. This could mean a network failure or upstream issue would cause cert-manager to hang for all ACME certs, even for other providers.
The main benefit of looping here is to wait before retrying. The native requeue mechanism of cert-manager doesn't.
As I understand it cert-manager follows the standard pattern of having a backoff on reconcile attempts, which should be exponential?
IMO blocking calls in reconcile methods is bad practice, it should be written so failures bubble up and retries are handled by requeuing with a backoff
There was a problem hiding this comment.
Indeed, you're right. I was double-checking the code and figured out that it was in the main loop. I updated the code to only try 10 times.
After further digging, the main issue seems to be that the current exponential backoff feature doesn't work for some reason.
It should be 5 seconds to 30 minutes:
cert-manager/pkg/controller/util.go
Line 52 in 1db7f8d
But this is definitely not the case. According to the issue we had, failing calls to Present() are re-queued and processed immediately, which increases the issue because the DigitalOcean API is then spammed with thousands of requests, triggering 429 errors for all services (such as external-dns) using it.
There was a problem hiding this comment.
I have not dug into this - but one reason the backoff may not be respected is object updates would trigger a reconcile always, so if we update the status with the error every time, it would trigger a reconcile every time
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>
1bb47b8 to
be073ab
Compare
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>
6bbd265 to
500fa6c
Compare
|
Should have been fixed by #8221. |
Pull Request Motivation
Fixes #6230.
Currently, if cert-manager hits the rate limit of the DigitalOcean API, it falls into an infinite retry loop and sends a lot of requests to the API, increasing the problem.
For instance, this happens in the scenario described in #3848 (comment).
This patch enables the exponential backoff feature of the DigitalOcean client to mitigate the issue.
Kind
/kind feature
Release Note