-
Notifications
You must be signed in to change notification settings - Fork 2.4k
The Venafi issuer doesn't reset the certificate and gets stuck with "This certificate has encountered errors. Fix any errors, and then click Retry" #6397
Description
With cert-manager 1.11.5, 1.12.3, 1.13.0 and 1.13.1, the /reset call we have added is never called, which means cert-manager never recovers from things like a CA failure when renewing a certificate.
I ran cert-manager in debug mode, and was able to confirm that the /reset endpoint is never called using mitmproxy:
Here is a summary of which versions properly reset, and which ones don't:
| cert-manager | Fixed? | How was it solved? |
|---|---|---|
| 1.9.* | ❌ | |
| 1.10.* | ❌ | |
| 1.11.0 | ❌ | Solution 3 built into VCert 4.23.0 but fails 50% of the time (Venafi/vcert#273) |
| 1.11.1 — 1.11.4 | ✅ | Solution 2 using VCert fork |
| 1.11.5 | ❌ | ResetCertificate in VCert 5.0.0 |
| 1.12.0 — 1.12.2 | ✅ | Solution 2 using VCert fork |
| 1.12.3 — 1.12.5 | ❌ | ResetCertificate in VCert 5.0.0 |
| 1.12.6 and up | ✅ | Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 |
| 1.13.0 — 1.13.1 | ❌ | ResetCertificate in VCert 5.0.0 |
| 1.13.2 and up | ✅ | Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 |
| 1.14.0 and up | ✅ | Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 |
"Solution 2" and "Solution 3" refer to the solutions in https://hackmd.io/@maelvls/design-reset-cert-when-requesting.
Steps to reproduce:
-
First, I ran
sshuttleto access Venafi's
https://tpp.tpp-tests.jetstack.net/:sshuttle --ssh-cmd="gcloud compute ssh --project=tpp-tests --zone=europe-west2-c" \ -l 0.0.0.0 -r bastion 10.132.0.2/32 10.4.0.2/32 -
Then, I accessed
https://tpp.tpp-tests.jetstack.net/aperture/application-integrations/ using
the userjetstack-platform(credentials in Venafi's 1Password) and I checked that the "VCert SDK" integration works with the userjetstack-platformuser. -
I created a token and stored it in the env var
TOKENwith the following:TPP_URL=https://tpp.tpp-tests.jetstack.net TPP_USER=jetstack-platform TPP_PWD= TOKEN=$(vcert getcred -u $TPP_URL --username $TPP_USER --password $TPP_PWD --client-id=vcert-sdk --scope='certificate:manage;configuration:manage' --format json | jq -r .access_token) TPP_CLIENT_ID=vcert-sdk -
Then I checked out the tag v1.12.3:
# In the cert-manager repo. git checkout v1.12.3 -
Then, I installed cert-manager:
make -j e2e-setup-certmanager
-
Then, I scaled down the cert-manager controller so that I can run it locally:
kubectl scale deployment cert-manager -n cert-manager --replicas=0
-
Then, I started mitmproxy:
mitmproxy -p 9090 --ssl-insecure -s ~/code/kubectl-incluster/watch-stream.py -
Then, I installed
kubectl-incluster:go get github.com/maelvls/kubectl-incluster@latest curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/watch-stream.py >/tmp/watch-stream.py -
Then, I ran Telepresence:
# Needed, otherwise Telepresence hangs. kubectl patch deploy -n cert-manager cert-manager --patch 'spec: {template: {spec: {securityContext: {runAsNonRoot: false}}}}' telepresence intercept -n cert-manager cert-manager --mount=false -- bash
-
Within that shell (actually, it doesn't matter if it is inside that shell or
not), let's run the cert-manager controller using Delve from the
Telepresence shell:export HTTPS_PROXY=http://localhost:9090 dlv debug --api-version=2 --headless -l :2345 ./cmd/controller -- --leader-elect=false -v=4 \ --kubeconfig=<(kubectl incluster --sa cert-manager/cert-manager --replace-ca-cert ~/.mitmproxy/mitmproxy-ca-cert.pem)
-
Then, I create the issuer and certificate:
kubectl apply -f- <<EOF apiVersion: cert-manager.io/v1 kind: Issuer metadata: name: issuer-1 spec: venafi: zone: '\\VED\\Policy\\cert-manager' tpp: url: "$TPP_URL" credentialsRef: name: issuer-1-credentials --- apiVersion: v1 kind: Secret metadata: name: issuer-1-credentials stringData: access-token: $(vcert getcred -u $TPP_URL \ --username $TPP_USER \ --password "$TPP_PWD" \ --client-id=vcert-sdk \ --format=json | jq -r .access_token) EOF kubectl apply -f- <<EOF apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: cert-1 spec: commonName: venafidemo.com dnsNames: - venafidemo.com issuerRef: kind: Issuer name: issuer-1 secretName: cert-1 EOF
-
The certificate should be issued successfully.
-
Let's now trigger an enrollment failure at a stage higher than 0. We will
purposefully break the CA. To do that, I RDP'ed into the TPP VM, I opened
the programcertsrv, right-clicked on the only entry, "Manage CA...", and
then clicked the "Stop" button:
-
Then, force the renewal of the certificate:
cmctl renew cert-1
-
It should show the error for 1 or 2 seconds:
500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: Post CSR failed with error: Cannot connect to the certificate authority (CA). Verify that your CA template settings are correct and that the remote server is available. For more information, search the Help system for Configuring the Microsoft Certificate Services Template Object., Stage: 500.
-
After 3 seconds, cert-manager has already retried and thus has "hidden" the
original error (TPP only shows the original error on the first retrieval).
The error shown is now:500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 500.
-
I then fixed the CA:
-
I then force-renewed:
cmctl renew cert-1
-
The error is still there:
500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 500.
Looking at mitmproxy's UI, you can also see that no /reset call was made.
/kind bug
