AWS Route53 - certificate renewal stuck due clean-up failure caused by no longer existing TXT record

**Describe the bug**:



From time to time, we observe that renewal *CertificateRequest* / *Order* / *Challenge* are stuck, not making any progress. Whenever this happens, the cert-manager controller starts to log the below pair of error messages (every 30 minutes). 

```
{"ts":1737979207959.8896,"caller":"acmechallenges/sync.go:132","msg":"error cleaning up challenge","logger":"cert-manager.controller","resource_name":"private-ingress-kubernetes-api-46-2551527191-1226756596","resource_namespace":"gloo-system","resource_kind":"Challenge","resource_version":"v1","dnsName":"<redacted>","type":"DNS-01","err":"failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: <REDACTED>, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.k8s.<redacted>.', type='TXT', set-identifier='\"1JUZfXQWHTJUgNA1mXiOTqyl5AredD3SOFdYFqpLI-Y\"'] but it was not found]"}
{"ts":1737979207960.0994,"caller":"controller/controller.go:157","msg":"re-queuing item due to error processing","logger":"cert-manager.controller","err":"failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: <REDACTED>, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.k8s.<redacted>.', type='TXT', set-identifier='\"1JUZfXQWHTJUgNA1mXiOTqyl5AredD3SOFdYFqpLI-Y\"'] but it was not found]"}
...
```

So, it looks like cert-manager is trying to delete some TXT record in AWS Route53, which no longer exists and then gets stuck in kind of an infinite retry-loop. I don't know why the TXT record no longer exist, maybe it was deleted by cert-manager before without also updating resp. deleting the corresponding *CertificateRequest* / *Order* / *Challenge* objects (maybe the Pod/Container being stopped/restarted between these two actions), maybe it was deleted by something else (out-of-band) ...

Actually, it does not really matter why this can happen, as it can happen anyway. Instead it is important that cert-manager can automatically handle and solve this kind of problems without the need for human interaction. 


The offending *ClusterIssuer* is using Let's Encrypt with ACME DNS Challenge via AWS Route53 and looks like this:

```yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: public
spec:
  acme:
    email: "<redacted>"
    preferredChain: ''
    privateKeySecretRef:
      name: cert-manager-public-issuer
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
      - dns01:
          route53: {}
        selector: {}
```

Looking at the code, following the below trace
- https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/controller/acmechallenges/sync.go#L130
- https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/issuer/acme/dns/route53/route53.go#L259
- https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/issuer/acme/dns/route53/route53.go#L291

I was able to trace this down to the following [lines of code](https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/issuer/acme/dns/route53/route53.go#L283):

```golang
	resp, err := r.client.ChangeResourceRecordSets(ctx, reqParams)
	if err != nil {
		if errors.Is(err, &route53types.InvalidChangeBatch{}) && action == route53types.ChangeActionDelete {
			log.V(logf.DebugLevel).WithValues("error", err).Info("ignoring InvalidChangeBatch error")
			// If we try to delete something and get a 'InvalidChangeBatch' that
			// means it's already deleted, no need to consider it an error.
			return nil
		}
		return fmt.Errorf("failed to change Route 53 record set: %v", removeReqID(err))

	}
```

And it looks like the special case `if errors.Is(err, &route53types.InvalidChangeBatch{}) && action == route53types.ChangeActionDelete` is not working as expected. Instead of ignoring (swallowing) the error when the TXT record no longer exists it returns an error (`return fmt.Errorf("failed to change Route 53 record set: %v", removeReqID(err))`).

Maybe this was introduced by https://github.com/cert-manager/cert-manager/commit/deab9548c0ada51b5e2a539b63357b9f66ef4e8d (maybe not, I a cannot yet tell). 

**Expected behaviour**:


Cert-manager can automatically handle and solve this kind of problems without the need for human interaction. This is important to ensure all certificates are automatically renewed in-time to prevent any damaged caused by expired certificates.

Naively, I would assume that https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/issuer/acme/dns/route53/route53.go#L283 must be fixed to not return an error whenever the corresponding TXT record no longer exists.

**Steps to reproduce the bug**:


We cannot (yet) reproduce the problem, but we observed it several times over the last few months.

**Anything else we need to know?**:

**Environment details**:
- Kubernetes version: `v1.30` and `v1.31`
- Cloud-provider/provisioner: AWS (using EKS)
- cert-manager version: `v1.16.2`
- Install method: helm

/kind bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Route53 - certificate renewal stuck due clean-up failure caused by no longer existing TXT record #7547

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AWS Route53 - certificate renewal stuck due clean-up failure caused by no longer existing TXT record #7547

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions