Skip to content

Newly elected cert-manager leader replica fails to issue pending CertificateRequest for a CA Issuer when the Kubernetes clusters has a lot of Secrets objects #5216

@rodrigorfk

Description

@rodrigorfk

Describe the bug:

Let's imagine a situation where, for any reason, a newly cert-manager replica is being elected as a leader, this can happen if the previous replica goes down, the nodes of the clusters are rotated and so on, if during the period of time when there was no leader, it could be during a few seconds or up to some minutes, a CertificateRequest is created, it will as expect stuck on pending state, is expected that once a new cert-manager is elected, the new leader will take action and issue the CertificateRequest that were created in the interim, however the new leader fails to do so and report the Referenced secret $NAMESPACE/$CA_SECRET not found: secret "$CA_SECRET" not found error.
Disclaimer: This bug only happens on certain conditions, mostly related to the overall amount of existing Secrets across all namespaces of the cluster and how big they are.

Expected behaviour:
I was expecting that on any condition, a CertificateRequest will always be issued and become Ready once a new cert-manager replica is elected as a leader.

Steps to reproduce the bug:
The bug can be, quite easily I would say, simulated using Kind, you can use the manifests and bash script below to test it, basically you have to create a lot of secrets in your cluster, bring cert-manager down, create the CertificateRequest and then bring cert-manager up again, you will see the CertificateRequest will never be issued and will stuck with the mentioned errored status above.

  • Manifests to create the sandbox namespace and deploy a selfsigned-issuer, a CA Certificate, another Issuer using the CA secret, the content of this snippet is referenced in the bash script below as the cert-manager-issuer-case.yaml file
apiVersion: v1
kind: Namespace
metadata:
  name: sandbox
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: selfsigned-issuer
  namespace: sandbox
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: istio-ca
  namespace: sandbox
spec:
  isCA: true
  commonName: istio-ca
  secretName: istio-ca
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned-issuer
    kind: Issuer
    group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: istio-ca
  namespace: sandbox
spec:
  ca:
    secretName: istio-ca
---
apiVersion: cert-manager.io/v1
kind: CertificateRequest
metadata:
  name: istio-csr-sq6jr
  namespace: sandbox
spec:
  duration: 24h0m0s
  extra:
    authentication.kubernetes.io/pod-name:
    - istio-csr-6b849f4d8f-slz6z
    authentication.kubernetes.io/pod-uid:
    - aa641ccc-23d5-41c9-be2c-3053f765849f
  groups:
  - system:serviceaccounts
  - system:serviceaccounts:istio-system
  - system:authenticated
  issuerRef:
    group: cert-manager.io
    kind: Issuer
    name: istio-ca
  request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ2l6Q0NBWE1DQVFBd0N6RUpNQWNHQTFVRUNoTUFNSUlCSWpBTkJna3Foa2lHOXcwQkFRRUZBQU9DQVE4QQpNSUlCQ2dLQ0FRRUF3dm9WZlc4VkZLU2d1SGZiZHkwZS91aTRoZmg5a2dVNExZZ2MxanJ2a1ZhUzYrZTRqTG5SCk9BeG5NU0g3cmszRTN5bC9VcXQ1SmIyaGdMMlRwVjFEVUJyOEVGR3o5c1Bna0RDaENPVWZGN0JoRWJyQ3M3YncKSS83QVR1aFdvZ09Eb0ZLTzd5WHk3Q21VRWVFbi9hRHRGQ3ZYeXVrVmF0TEQzMm9jeFgwbTBjeTY0TEF0TVl2VgpSRFBjMnpwaGhLMEw1cCtLeXBlUXFwUnU0eFl0TjhiQVNpeVRpVEVJcVVNQlg5MVRwbTV3WUxXa1lmcSt3Ujg5CmFIdHFUay9OODZXdXRNL1BsanZORkpUZ3EzcFBpRE03SW1MeCtDdjdCNlg0YkZoMGo4OE9Db2Q2QUJGZEh5WloKdU5MU0NydFVpNzM4Ukp1T2VKK1VWeFJCM1NhOTQ3dzVvUUlEQVFBQm9Ec3dPUVlKS29aSWh2Y05BUWtPTVN3dwpLakFvQmdOVkhSRUJBZjhFSGpBY2docHBjM1JwYnkxamMzSXVhWE4wYVc4dGMzbHpkR1Z0TG5OMll6QU5CZ2txCmhraUc5dzBCQVFzRkFBT0NBUUVBYmZGTnlPQzgvZ25oUFNkRXdWQVc1MWpvaDZrTlhHbnpPWWpSMTRtSGEzS3QKR3E2aUpEQ0tENWNxTnRxdzdlTktvcHJsWUtxbFpNYzltY3dZSGFoVmcwRm1BTFhWQ3haSnhQZEphVjZwUkJTOApKREFJZGdneWNnVzVsckRKcWlRNFphSDhCZ2RtcS9WcDUzZ0Urb1MxWE5ML2VtaWZPVzQ4ZTR1dFM2U2ZDMGJ1CkdFTmhJL0h5RW52QlpGK3dkY3ZJaDlvaTlwTkZGUkxkZlhZQXpESEpiOTdhcnp5Q3hSVzc1akZqWUo4bC8zSzQKZDBFTjZ3c2ZSNDBMMmFQdURWazVjS2JZdGNkVmd6TGI4S25IQkxKNk1iNGl6b0FaS3NUNGNSbmw3Z3dXMW9oaQplQWVqeUs0bGFIOE8wK01JUWRjalg3QjVldkk3YS9WelFYVmh3Y1dIOFE9PQotLS0tLUVORCBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0K
  uid: 48e8b123-fdca-40d9-9c0a-ab8694c67320
  usages:
  - server auth
  username: system:serviceaccount:istio-system:istio-csr
  • test case scenario:
#!/bin/bash

kind create cluster \
  --name playground --image "kindest/node:v1.21.1@sha256:69860bda5563ac81e3c0057d654b5253219618a22ec3a346306239bba8cfa1a6" \
  --wait 5m

helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --version v1.8.1 \
  --set installCRDs=true \
  --wait

# create a lot of secrets with quite big content on it
if ! kubectl -n default get secret my-secret-100 > /dev/null 2>&1; then
  for i in {1..100}
  do
    cat /dev/urandom | env LC_ALL=C tr -dc 'a-zA-Z0-9' | fold -w 800000 | head -n 1 > file${i}.txt
    kubectl -n default create secret generic my-secret-${i} --from-file=key1=file${i}.txt
    rm file${i}.txt
  done
fi

# creates the sandbox namespace and deploy a selfsigned-issuer, a CA Certificate, another Issuer using the CA secret 
# and finally a CertificateRequest
kubectl apply -f cert-manager-issuer-case.yaml
# as you can see here, if a cert-manager pod is running and a leader is elected everything will be fine on issuing the CertificateRequest
kubectl -n sandbox wait CertificateRequest istio-csr-sq6jr --for condition=Ready

# however, if we delete the CertificateRequest, stop the cert-manager leader
kubectl -n sandbox delete CertificateRequest istio-csr-sq6jr
kubectl -n cert-manager scale deploy/cert-manager --replicas=0
sleep 15

# create the CertificateRequest again without a healthy cert-manager leader elected
kubectl apply -f cert-manager-issuer-case.yaml

# once the new leader is elected, the CertificateRequest won't be issued and never becomes Ready
kubectl -n cert-manager scale deploy/cert-manager --replicas=1

sleep 15
kubectl -n sandbox get CertificateRequest -o wide -w
# the output from the command above should demostrate the following error in the status column:
# Referenced secret sandbox/istio-ca not found: secret "istio-ca" not found

Screenshot 2022-06-16 at 14 23 04

Screenshot 2022-06-16 at 14 23 42

Anything else we need to know?:
I was analysing the cert-manager code and the issue seems to be a race condition on regards to how the secretsLister cache used here to find the CA secret is asynchronous populated, looks like that on certain conditions, the CA CertificatesRequests can try to sign a CertificateRequest when the secretsLister cache was not completely populated, like in my case, when the cluster has more than 400 secrets, mostly of then, big secrets containing helm releases data.

Environment details::

  • Kubernetes version: k8s 1.21.x, tested on 1.21.12 and 1.21.1
  • Cloud-provider/provisioner: AWS EKS and Kubernetes Kind
  • cert-manager version: 1.7.1 and 1.8.1
  • Install method: helm

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions