-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Newly elected cert-manager leader replica fails to issue pending CertificateRequest for a CA Issuer when the Kubernetes clusters has a lot of Secrets objects #5216
Description
Describe the bug:
Let's imagine a situation where, for any reason, a newly cert-manager replica is being elected as a leader, this can happen if the previous replica goes down, the nodes of the clusters are rotated and so on, if during the period of time when there was no leader, it could be during a few seconds or up to some minutes, a CertificateRequest is created, it will as expect stuck on pending state, is expected that once a new cert-manager is elected, the new leader will take action and issue the CertificateRequest that were created in the interim, however the new leader fails to do so and report the Referenced secret $NAMESPACE/$CA_SECRET not found: secret "$CA_SECRET" not found error.
Disclaimer: This bug only happens on certain conditions, mostly related to the overall amount of existing Secrets across all namespaces of the cluster and how big they are.
Expected behaviour:
I was expecting that on any condition, a CertificateRequest will always be issued and become Ready once a new cert-manager replica is elected as a leader.
Steps to reproduce the bug:
The bug can be, quite easily I would say, simulated using Kind, you can use the manifests and bash script below to test it, basically you have to create a lot of secrets in your cluster, bring cert-manager down, create the CertificateRequest and then bring cert-manager up again, you will see the CertificateRequest will never be issued and will stuck with the mentioned errored status above.
- Manifests to create the sandbox namespace and deploy a selfsigned-issuer, a CA Certificate, another Issuer using the CA secret, the content of this snippet is referenced in the bash script below as the
cert-manager-issuer-case.yamlfile
apiVersion: v1
kind: Namespace
metadata:
name: sandbox
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: sandbox
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: istio-ca
namespace: sandbox
spec:
isCA: true
commonName: istio-ca
secretName: istio-ca
privateKey:
algorithm: ECDSA
size: 256
issuerRef:
name: selfsigned-issuer
kind: Issuer
group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: istio-ca
namespace: sandbox
spec:
ca:
secretName: istio-ca
---
apiVersion: cert-manager.io/v1
kind: CertificateRequest
metadata:
name: istio-csr-sq6jr
namespace: sandbox
spec:
duration: 24h0m0s
extra:
authentication.kubernetes.io/pod-name:
- istio-csr-6b849f4d8f-slz6z
authentication.kubernetes.io/pod-uid:
- aa641ccc-23d5-41c9-be2c-3053f765849f
groups:
- system:serviceaccounts
- system:serviceaccounts:istio-system
- system:authenticated
issuerRef:
group: cert-manager.io
kind: Issuer
name: istio-ca
request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ2l6Q0NBWE1DQVFBd0N6RUpNQWNHQTFVRUNoTUFNSUlCSWpBTkJna3Foa2lHOXcwQkFRRUZBQU9DQVE4QQpNSUlCQ2dLQ0FRRUF3dm9WZlc4VkZLU2d1SGZiZHkwZS91aTRoZmg5a2dVNExZZ2MxanJ2a1ZhUzYrZTRqTG5SCk9BeG5NU0g3cmszRTN5bC9VcXQ1SmIyaGdMMlRwVjFEVUJyOEVGR3o5c1Bna0RDaENPVWZGN0JoRWJyQ3M3YncKSS83QVR1aFdvZ09Eb0ZLTzd5WHk3Q21VRWVFbi9hRHRGQ3ZYeXVrVmF0TEQzMm9jeFgwbTBjeTY0TEF0TVl2VgpSRFBjMnpwaGhLMEw1cCtLeXBlUXFwUnU0eFl0TjhiQVNpeVRpVEVJcVVNQlg5MVRwbTV3WUxXa1lmcSt3Ujg5CmFIdHFUay9OODZXdXRNL1BsanZORkpUZ3EzcFBpRE03SW1MeCtDdjdCNlg0YkZoMGo4OE9Db2Q2QUJGZEh5WloKdU5MU0NydFVpNzM4Ukp1T2VKK1VWeFJCM1NhOTQ3dzVvUUlEQVFBQm9Ec3dPUVlKS29aSWh2Y05BUWtPTVN3dwpLakFvQmdOVkhSRUJBZjhFSGpBY2docHBjM1JwYnkxamMzSXVhWE4wYVc4dGMzbHpkR1Z0TG5OMll6QU5CZ2txCmhraUc5dzBCQVFzRkFBT0NBUUVBYmZGTnlPQzgvZ25oUFNkRXdWQVc1MWpvaDZrTlhHbnpPWWpSMTRtSGEzS3QKR3E2aUpEQ0tENWNxTnRxdzdlTktvcHJsWUtxbFpNYzltY3dZSGFoVmcwRm1BTFhWQ3haSnhQZEphVjZwUkJTOApKREFJZGdneWNnVzVsckRKcWlRNFphSDhCZ2RtcS9WcDUzZ0Urb1MxWE5ML2VtaWZPVzQ4ZTR1dFM2U2ZDMGJ1CkdFTmhJL0h5RW52QlpGK3dkY3ZJaDlvaTlwTkZGUkxkZlhZQXpESEpiOTdhcnp5Q3hSVzc1akZqWUo4bC8zSzQKZDBFTjZ3c2ZSNDBMMmFQdURWazVjS2JZdGNkVmd6TGI4S25IQkxKNk1iNGl6b0FaS3NUNGNSbmw3Z3dXMW9oaQplQWVqeUs0bGFIOE8wK01JUWRjalg3QjVldkk3YS9WelFYVmh3Y1dIOFE9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K
uid: 48e8b123-fdca-40d9-9c0a-ab8694c67320
usages:
- server auth
username: system:serviceaccount:istio-system:istio-csr
- test case scenario:
#!/bin/bash
kind create cluster \
--name playground --image "kindest/node:v1.21.1@sha256:69860bda5563ac81e3c0057d654b5253219618a22ec3a346306239bba8cfa1a6" \
--wait 5m
helm install \
cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--version v1.8.1 \
--set installCRDs=true \
--wait
# create a lot of secrets with quite big content on it
if ! kubectl -n default get secret my-secret-100 > /dev/null 2>&1; then
for i in {1..100}
do
cat /dev/urandom | env LC_ALL=C tr -dc 'a-zA-Z0-9' | fold -w 800000 | head -n 1 > file${i}.txt
kubectl -n default create secret generic my-secret-${i} --from-file=key1=file${i}.txt
rm file${i}.txt
done
fi
# creates the sandbox namespace and deploy a selfsigned-issuer, a CA Certificate, another Issuer using the CA secret
# and finally a CertificateRequest
kubectl apply -f cert-manager-issuer-case.yaml
# as you can see here, if a cert-manager pod is running and a leader is elected everything will be fine on issuing the CertificateRequest
kubectl -n sandbox wait CertificateRequest istio-csr-sq6jr --for condition=Ready
# however, if we delete the CertificateRequest, stop the cert-manager leader
kubectl -n sandbox delete CertificateRequest istio-csr-sq6jr
kubectl -n cert-manager scale deploy/cert-manager --replicas=0
sleep 15
# create the CertificateRequest again without a healthy cert-manager leader elected
kubectl apply -f cert-manager-issuer-case.yaml
# once the new leader is elected, the CertificateRequest won't be issued and never becomes Ready
kubectl -n cert-manager scale deploy/cert-manager --replicas=1
sleep 15
kubectl -n sandbox get CertificateRequest -o wide -w
# the output from the command above should demostrate the following error in the status column:
# Referenced secret sandbox/istio-ca not found: secret "istio-ca" not found
Anything else we need to know?:
I was analysing the cert-manager code and the issue seems to be a race condition on regards to how the secretsLister cache used here to find the CA secret is asynchronous populated, looks like that on certain conditions, the CA CertificatesRequests can try to sign a CertificateRequest when the secretsLister cache was not completely populated, like in my case, when the cluster has more than 400 secrets, mostly of then, big secrets containing helm releases data.
Environment details::
- Kubernetes version: k8s 1.21.x, tested on 1.21.12 and 1.21.1
- Cloud-provider/provisioner: AWS EKS and Kubernetes Kind
- cert-manager version: 1.7.1 and 1.8.1
- Install method: helm
/kind bug

