-
Notifications
You must be signed in to change notification settings - Fork 2.4k
AWS Route53 - certificate renewal stuck due clean-up failure caused by no longer existing TXT record #7547
Description
Describe the bug:
From time to time, we observe that renewal CertificateRequest / Order / Challenge are stuck, not making any progress. Whenever this happens, the cert-manager controller starts to log the below pair of error messages (every 30 minutes).
{"ts":1737979207959.8896,"caller":"acmechallenges/sync.go:132","msg":"error cleaning up challenge","logger":"cert-manager.controller","resource_name":"private-ingress-kubernetes-api-46-2551527191-1226756596","resource_namespace":"gloo-system","resource_kind":"Challenge","resource_version":"v1","dnsName":"<redacted>","type":"DNS-01","err":"failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: <REDACTED>, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.k8s.<redacted>.', type='TXT', set-identifier='\"1JUZfXQWHTJUgNA1mXiOTqyl5AredD3SOFdYFqpLI-Y\"'] but it was not found]"}
{"ts":1737979207960.0994,"caller":"controller/controller.go:157","msg":"re-queuing item due to error processing","logger":"cert-manager.controller","err":"failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: <REDACTED>, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.k8s.<redacted>.', type='TXT', set-identifier='\"1JUZfXQWHTJUgNA1mXiOTqyl5AredD3SOFdYFqpLI-Y\"'] but it was not found]"}
...
So, it looks like cert-manager is trying to delete some TXT record in AWS Route53, which no longer exists and then gets stuck in kind of an infinite retry-loop. I don't know why the TXT record no longer exist, maybe it was deleted by cert-manager before without also updating resp. deleting the corresponding CertificateRequest / Order / Challenge objects (maybe the Pod/Container being stopped/restarted between these two actions), maybe it was deleted by something else (out-of-band) ...
Actually, it does not really matter why this can happen, as it can happen anyway. Instead it is important that cert-manager can automatically handle and solve this kind of problems without the need for human interaction.
The offending ClusterIssuer is using Let's Encrypt with ACME DNS Challenge via AWS Route53 and looks like this:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: public
spec:
acme:
email: "<redacted>"
preferredChain: ''
privateKeySecretRef:
name: cert-manager-public-issuer
server: https://acme-v02.api.letsencrypt.org/directory
solvers:
- dns01:
route53: {}
selector: {}Looking at the code, following the below trace
- https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/controller/acmechallenges/sync.go#L130
- https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/issuer/acme/dns/route53/route53.go#L259
- https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/issuer/acme/dns/route53/route53.go#L291
I was able to trace this down to the following lines of code:
resp, err := r.client.ChangeResourceRecordSets(ctx, reqParams)
if err != nil {
if errors.Is(err, &route53types.InvalidChangeBatch{}) && action == route53types.ChangeActionDelete {
log.V(logf.DebugLevel).WithValues("error", err).Info("ignoring InvalidChangeBatch error")
// If we try to delete something and get a 'InvalidChangeBatch' that
// means it's already deleted, no need to consider it an error.
return nil
}
return fmt.Errorf("failed to change Route 53 record set: %v", removeReqID(err))
}And it looks like the special case if errors.Is(err, &route53types.InvalidChangeBatch{}) && action == route53types.ChangeActionDelete is not working as expected. Instead of ignoring (swallowing) the error when the TXT record no longer exists it returns an error (return fmt.Errorf("failed to change Route 53 record set: %v", removeReqID(err))).
Maybe this was introduced by deab954 (maybe not, I a cannot yet tell).
Expected behaviour:
Cert-manager can automatically handle and solve this kind of problems without the need for human interaction. This is important to ensure all certificates are automatically renewed in-time to prevent any damaged caused by expired certificates.
Naively, I would assume that https://github.com/cert-manager/cert-manager/blob/v1.16.2/pkg/issuer/acme/dns/route53/route53.go#L283 must be fixed to not return an error whenever the corresponding TXT record no longer exists.
Steps to reproduce the bug:
We cannot (yet) reproduce the problem, but we observed it several times over the last few months.
Anything else we need to know?:
Environment details:
- Kubernetes version:
v1.30andv1.31 - Cloud-provider/provisioner: AWS (using EKS)
- cert-manager version:
v1.16.2 - Install method: helm
/kind bug