-
Notifications
You must be signed in to change notification settings - Fork 2.4k
More than one Certificate nominating same Secret induces runaway creation of many CertificateRequests and Orders #4846
Description
Describe the bug:
As first noted in the "cert-manager" channel in the "Kubernetes" Slack workspace, we suffered a cluster outage that turned out to be due to cert-manager.
Someone created two Certificate objects in the same namespace that designated the same Secret name in the “spec.secretName” field. These Certificates use a ClusterIssuer with an ACME solver (Let's Encrypt). We later witnessed that cert-manager was creating several thousand CertificateRequests and Orders per hour, overwhelming etcd's ability to compact its database.
Deleting one of the two Certificate objects stopped cert-manager from creating these CertificateRequests and Orders. There were over 25,000 of them accumulated at this point. We had to delete them all manually to get things back under control.
I understand that the configuration is malformed when taking into consideration more than one Certificate, even if each of the two Certificates appeared valid in isolation, but we’re now worried that cert-manager can run away on us like this again. We’re considering whether we can catch this problem at admission time with a Webhook via Kyverno or Gatekeeper. Ideally cert-manager could notice this conflicting configuration and log messages about it, post conditions to the Certificate (and maybe Secret) objects, and increment a Prometheus metric, but it should not spin creating these objects so feverishly.
Expected behavior:
cert-manager would notice the unacceptable desire to share the same Secret between two or more motivating Certificates, and signal to the cluster users that it can't reconcile the configuration. The following signals would be helpful:
- Condition on the Certificate
- Condition on the Secret
- Log messages
- Prometheus metrics
- Gauge counting invalid sharing arrangements
- Counter of reconciliation attempts that failed due to this invalid sharing
Some of these may already exist. In our frenzy to restore our cluster to normal operation again, we couldn't pause long to look for these signals. In hindsight, though, these are what come to mind.
Steps to reproduce the bug:
- Create an Issuer or ClusterIssuer using an ACME solver.
- Create two Certificate objects in the same namespace nominating the same Secret in their "spec.secretName" field.
- Confirm that cert-manager issues a certificate and creates the Secret on behalf of one of the two Certificates.
- Using a client like kubectl, watch the count of the CertificateRequest and Order objects in the same namespace climb.
Alternately, watch theetcd_object_countsPrometheus metric with a "resource" label constraint such asresource=~"(certificaterequests|orders\\.acme)\\.certmanager\\.io". In our case, we'd see cert-manager create around 3,000 of these objects over the course of an hour, then stop for a couple of hours, then start again. - Delete one of the two Certificates.
- Observe that cert-manager stops creating CertificateRequests and Orders for this namespace.
- Delete all the CertificateRequests and Orders in this namespace.
- Observe that cert-manager does not create more CertificateRequests and Orders for this namespace.
That is, unless required to issue other certificates.
Anything else we need to know?:
We have seen similar trouble with more than one Ingress in the same namespace using the same set of DNS names. cert-manager doesn't spin creating all these objects, but it does complain repeatedly in its logs about how the Certificate it creates is owned by the wrong Ingress, with log entries like the following:
sync.go:374] cert-manager/controller/ingress-shim "msg"="certificate resource is not owned by this object. refusing to update non-owned certificate resource for object" "related_resource_kind"="Certificate"
That's another example of a Kubernetes object's configuration being valid in isolation, but not when pit against competing objects in the same namespace.
Environment details::
- Kubernetes version: 1.19.7
- Cloud-provider/provisioner: AWS/kOps
- cert-manager version: 1.6.1
- Install method: static manifests tailored by kustomize, reconciled via Flux
/kind bug