Skip to content

feat: exponential backoff for DigitalOcean API calls#8205

Closed
dunglas wants to merge 5 commits intocert-manager:masterfrom
dunglas:feat/do-retry
Closed

feat: exponential backoff for DigitalOcean API calls#8205
dunglas wants to merge 5 commits intocert-manager:masterfrom
dunglas:feat/do-retry

Conversation

@dunglas
Copy link
Copy Markdown
Contributor

@dunglas dunglas commented Oct 24, 2025

Pull Request Motivation

Fixes #6230.

Currently, if cert-manager hits the rate limit of the DigitalOcean API, it falls into an infinite retry loop and sends a lot of requests to the API, increasing the problem.

For instance, this happens in the scenario described in #3848 (comment).

This patch enables the exponential backoff feature of the DigitalOcean client to mitigate the issue.

Kind

/kind feature

Release Note

Added exponential backoff when encountering 429 or 500 errors when requesting the DigitalOcean API.

@cert-manager-prow cert-manager-prow bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/feature Categorizes issue or PR as related to a new feature. area/acme Indicates a PR directly modifies the ACME Issuer code area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code labels Oct 24, 2025
@cert-manager-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sgtcodfish for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cert-manager-prow cert-manager-prow bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 24, 2025
@cert-manager-prow
Copy link
Copy Markdown
Contributor

Hi @dunglas. Thanks for your PR.

I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cert-manager-prow cert-manager-prow bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 24, 2025
@erikgb erikgb requested a review from Copilot October 24, 2025 15:34
@erikgb
Copy link
Copy Markdown
Member

erikgb commented Oct 24, 2025

/ok-to-test

@cert-manager-prow cert-manager-prow bot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 24, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds exponential backoff retry logic to DigitalOcean API calls to prevent infinite retry loops when rate limits are hit. The change configures the DigitalOcean client to automatically retry failed requests (429/500 errors) with exponential backoff, limiting retries to a maximum of 3 attempts.

Key changes:

  • Enabled exponential backoff for DigitalOcean API client with a retry limit of 3 attempts
  • Refactored client initialization to use inline option syntax for better readability

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@erikgb
Copy link
Copy Markdown
Member

erikgb commented Oct 24, 2025

Thanks for this, @dunglas! The change looks good to me, but I would love to get a second opinion from @inteon. Have you been able to test this end-to-end with DigitalOcean? If this fixes the issue, we should consider back-porting the fix. WDYT @inteon?

/cc @inteon

@cert-manager-prow cert-manager-prow bot requested a review from inteon October 24, 2025 15:46
@dunglas
Copy link
Copy Markdown
Contributor Author

dunglas commented Oct 24, 2025

I didn't e2e-tested this but I know how to write a test doing this. I'll add it.

@cert-manager-prow cert-manager-prow bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 25, 2025
@cert-manager-prow cert-manager-prow bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Oct 27, 2025
@dunglas
Copy link
Copy Markdown
Contributor Author

dunglas commented Oct 27, 2025

E2E test added.

@dunglas
Copy link
Copy Markdown
Contributor Author

dunglas commented Oct 27, 2025

/retest

@erikgb erikgb requested a review from Copilot October 27, 2025 18:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

client, err := godo.New(c, clientOpts...)
client, err := godo.New(c,
godo.SetUserAgent(userAgent),
godo.WithRetryAndBackoffs(godo.RetryConfig{RetryMax: math.MaxInt}),
Copy link

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting RetryMax to math.MaxInt creates an effectively infinite retry loop, which contradicts the PR's goal of preventing infinite retries. Consider using a reasonable finite value like 10-20 retries to avoid prolonged blocking operations.

Suggested change
godo.WithRetryAndBackoffs(godo.RetryConfig{RetryMax: math.MaxInt}),
godo.WithRetryAndBackoffs(godo.RetryConfig{RetryMax: 10}),

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with Copilot here. What is the benefit of having an infinite retry loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infinite retry loop is already there because cert-manager automatically requeue failed requests.

The main benefit of looping here is to wait before retrying. The native requeue mechanism of cert-manager doesn't.

Without this, the backoff isn't exponential anymore.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an infinite loop here would block the reconcile, which in turn would prevent other resources being reconciled. This could mean a network failure or upstream issue would cause cert-manager to hang for all ACME certs, even for other providers.

The main benefit of looping here is to wait before retrying. The native requeue mechanism of cert-manager doesn't.

As I understand it cert-manager follows the standard pattern of having a backoff on reconcile attempts, which should be exponential?

IMO blocking calls in reconcile methods is bad practice, it should be written so failures bubble up and retries are handled by requeuing with a backoff

Copy link
Copy Markdown
Contributor Author

@dunglas dunglas Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, you're right. I was double-checking the code and figured out that it was in the main loop. I updated the code to only try 10 times.

After further digging, the main issue seems to be that the current exponential backoff feature doesn't work for some reason.

It should be 5 seconds to 30 minutes:

return workqueue.NewTypedItemExponentialFailureRateLimiter[types.NamespacedName](time.Second*5, time.Minute*30)

But this is definitely not the case. According to the issue we had, failing calls to Present() are re-queued and processed immediately, which increases the issue because the DigitalOcean API is then spammed with thousands of requests, triggering 429 errors for all services (such as external-dns) using it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not dug into this - but one reason the backoff may not be respected is object updates would trigger a reconcile always, so if we update the status with the error every time, it would trigger a reconcile every time

@cert-manager-prow cert-manager-prow bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2025
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>

# Conflicts:
#	pkg/issuer/acme/dns/digitalocean/digitalocean_test.go
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>
@cert-manager-prow cert-manager-prow bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Oct 28, 2025
Signed-off-by: Kévin Dunglas <kevin@dunglas.fr>
@cert-manager-prow cert-manager-prow bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Oct 28, 2025
@dunglas
Copy link
Copy Markdown
Contributor Author

dunglas commented Nov 4, 2025

Should have been fixed by #8221.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code area/acme Indicates a PR directly modifies the ACME Issuer code dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DigitalOcean: cert-manager DDoSes DNS-01 solver - infinite rate limiting

4 participants