Skip to content

AWS Route53 - certificate renewal stuck due clean-up failure caused by no longer existing TXT record #7547

@alex-berger

Description

@alex-berger

Describe the bug:

From time to time, we observe that renewal CertificateRequest / Order / Challenge are stuck, not making any progress. Whenever this happens, the cert-manager controller starts to log the below pair of error messages (every 30 minutes).

{"ts":1737979207959.8896,"caller":"acmechallenges/sync.go:132","msg":"error cleaning up challenge","logger":"cert-manager.controller","resource_name":"private-ingress-kubernetes-api-46-2551527191-1226756596","resource_namespace":"gloo-system","resource_kind":"Challenge","resource_version":"v1","dnsName":"<redacted>","type":"DNS-01","err":"failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: <REDACTED>, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.k8s.<redacted>.', type='TXT', set-identifier='\"1JUZfXQWHTJUgNA1mXiOTqyl5AredD3SOFdYFqpLI-Y\"'] but it was not found]"}
{"ts":1737979207960.0994,"caller":"controller/controller.go:157","msg":"re-queuing item due to error processing","logger":"cert-manager.controller","err":"failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: <REDACTED>, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.k8s.<redacted>.', type='TXT', set-identifier='\"1JUZfXQWHTJUgNA1mXiOTqyl5AredD3SOFdYFqpLI-Y\"'] but it was not found]"}
...

So, it looks like cert-manager is trying to delete some TXT record in AWS Route53, which no longer exists and then gets stuck in kind of an infinite retry-loop. I don't know why the TXT record no longer exist, maybe it was deleted by cert-manager before without also updating resp. deleting the corresponding CertificateRequest / Order / Challenge objects (maybe the Pod/Container being stopped/restarted between these two actions), maybe it was deleted by something else (out-of-band) ...

Actually, it does not really matter why this can happen, as it can happen anyway. Instead it is important that cert-manager can automatically handle and solve this kind of problems without the need for human interaction.

The offending ClusterIssuer is using Let's Encrypt with ACME DNS Challenge via AWS Route53 and looks like this:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: public
spec:
  acme:
    email: "<redacted>"
    preferredChain: ''
    privateKeySecretRef:
      name: cert-manager-public-issuer
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
      - dns01:
          route53: {}
        selector: {}

Looking at the code, following the below trace

I was able to trace this down to the following lines of code:

	resp, err := r.client.ChangeResourceRecordSets(ctx, reqParams)
	if err != nil {
		if errors.Is(err, &route53types.InvalidChangeBatch{}) && action == route53types.ChangeActionDelete {
			log.V(logf.DebugLevel).WithValues("error", err).Info("ignoring InvalidChangeBatch error")
			// If we try to delete something and get a 'InvalidChangeBatch' that
			// means it's already deleted, no need to consider it an error.
			return nil
		}
		return fmt.Errorf("failed to change Route 53 record set: %v", removeReqID(err))

	}

And it looks like the special case if errors.Is(err, &route53types.InvalidChangeBatch{}) && action == route53types.ChangeActionDelete is not working as expected. Instead of ignoring (swallowing) the error when the TXT record no longer exists it returns an error (return fmt.Errorf("failed to change Route 53 record set: %v", removeReqID(err))).

Maybe this was introduced by deab954 (maybe not, I a cannot yet tell).

Expected behaviour:

Cert-manager can automatically handle and solve this kind of problems without the need for human interaction. This is important to ensure all certificates are automatically renewed in-time to prevent any damaged caused by expired certificates.

Naively, I would assume that https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/issuer/acme/dns/route53/route53.go#L283 must be fixed to not return an error whenever the corresponding TXT record no longer exists.

Steps to reproduce the bug:

We cannot (yet) reproduce the problem, but we observed it several times over the last few months.

Anything else we need to know?:

Environment details:

  • Kubernetes version: v1.30 and v1.31
  • Cloud-provider/provisioner: AWS (using EKS)
  • cert-manager version: v1.16.2
  • Install method: helm

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions