Skip to content

The Venafi issuer doesn't reset the certificate and gets stuck with "This certificate has encountered errors. Fix any errors, and then click Retry" #6397

@maelvls

Description

@maelvls

With cert-manager 1.11.5, 1.12.3, 1.13.0 and 1.13.1, the /reset call we have added is never called, which means cert-manager never recovers from things like a CA failure when renewing a certificate.

I ran cert-manager in debug mode, and was able to confirm that the /reset endpoint is never called using mitmproxy:

Screenshot 2023-10-06 at 15 30 14

Here is a summary of which versions properly reset, and which ones don't:

cert-manager Fixed? How was it solved?
1.9.*
1.10.*
1.11.0 Solution 3 built into VCert 4.23.0 but fails 50% of the time (Venafi/vcert#273)
1.11.1 — 1.11.4 Solution 2 using VCert fork
1.11.5 Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 Doesn't work, see #6397
1.12.0 — 1.12.2 Solution 2 using VCert fork
1.12.3 — 1.12.5 Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 Doesn't work, see #6397
1.12.6 and up Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0
1.13.0 — 1.13.1 Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0 Doesn't work, see #6397
1.13.2 and up Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0
1.14.0 and up Solution 2 ad-hoc using ResetCertificate in VCert 5.0.0

"Solution 2" and "Solution 3" refer to the solutions in https://hackmd.io/@maelvls/design-reset-cert-when-requesting.

Steps to reproduce:

  1. First, I ran sshuttle to access Venafi's
    https://tpp.tpp-tests.jetstack.net/:

    sshuttle --ssh-cmd="gcloud compute ssh --project=tpp-tests --zone=europe-west2-c" \
       -l 0.0.0.0 -r bastion 10.132.0.2/32 10.4.0.2/32
  2. Then, I accessed
    https://tpp.tpp-tests.jetstack.net/aperture/application-integrations/ using
    the user jetstack-platform (credentials in Venafi's 1Password) and I checked that the "VCert SDK" integration works with the user jetstack-platform user.

  3. I created a token and stored it in the env var TOKEN with the following:

    TPP_URL=https://tpp.tpp-tests.jetstack.net
    TPP_USER=jetstack-platform
    TPP_PWD=
    TOKEN=$(vcert getcred -u $TPP_URL --username $TPP_USER --password $TPP_PWD --client-id=vcert-sdk --scope='certificate:manage;configuration:manage' --format json | jq -r .access_token)
    TPP_CLIENT_ID=vcert-sdk
  4. Then I checked out the tag v1.12.3:

    # In the cert-manager repo.
    git checkout v1.12.3
  5. Then, I installed cert-manager:

    make -j e2e-setup-certmanager
  6. Then, I scaled down the cert-manager controller so that I can run it locally:

    kubectl scale deployment cert-manager -n cert-manager --replicas=0
  7. Then, I started mitmproxy:

    mitmproxy -p 9090 --ssl-insecure -s ~/code/kubectl-incluster/watch-stream.py
  8. Then, I installed kubectl-incluster:

    go get github.com/maelvls/kubectl-incluster@latest
    curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/watch-stream.py >/tmp/watch-stream.py
  9. Then, I ran Telepresence:

    # Needed, otherwise Telepresence hangs.
    kubectl patch deploy -n cert-manager cert-manager --patch 'spec: {template: {spec: {securityContext: {runAsNonRoot: false}}}}'
    telepresence intercept -n cert-manager cert-manager --mount=false -- bash
  10. Within that shell (actually, it doesn't matter if it is inside that shell or
    not), let's run the cert-manager controller using Delve from the
    Telepresence shell:

    export HTTPS_PROXY=http://localhost:9090
    dlv debug --api-version=2 --headless -l :2345 ./cmd/controller -- --leader-elect=false -v=4 \
      --kubeconfig=<(kubectl incluster --sa cert-manager/cert-manager --replace-ca-cert ~/.mitmproxy/mitmproxy-ca-cert.pem)
  11. Then, I create the issuer and certificate:

    kubectl apply -f- <<EOF
    apiVersion: cert-manager.io/v1
    kind: Issuer
    metadata:
    name: issuer-1
    spec:
    venafi:
        zone: '\\VED\\Policy\\cert-manager'
        tpp:
        url: "$TPP_URL"
        credentialsRef:
            name: issuer-1-credentials
    ---
    apiVersion: v1
    kind: Secret
    metadata:
    name: issuer-1-credentials
    stringData:
    access-token: $(vcert getcred -u $TPP_URL \
    --username $TPP_USER \
    --password "$TPP_PWD" \
    --client-id=vcert-sdk \
    --format=json | jq -r .access_token)
    EOF
    
    kubectl apply -f- <<EOF
    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
    name: cert-1
    spec:
    commonName: venafidemo.com
    dnsNames:
        - venafidemo.com
    issuerRef:
        kind: Issuer
        name: issuer-1
    secretName: cert-1
    EOF
  12. The certificate should be issued successfully.

  13. Let's now trigger an enrollment failure at a stage higher than 0. We will
    purposefully break the CA. To do that, I RDP'ed into the TPP VM, I opened
    the program certsrv, right-clicked on the only entry, "Manage CA...", and
    then clicked the "Stop" button:

  14. Then, force the renewal of the certificate:

    cmctl renew cert-1
  15. It should show the error for 1 or 2 seconds:

    500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: Post CSR failed with error: Cannot connect to the certificate authority (CA). Verify that your CA template settings are correct and that the remote server is available. For more information, search the Help system for Configuring the Microsoft Certificate Services Template Object., Stage: 500.

  16. After 3 seconds, cert-manager has already retried and thus has "hidden" the
    original error (TPP only shows the original error on the first retrieval).
    The error shown is now:

    500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 500.

  17. I then fixed the CA:

  18. I then force-renewed:

    cmctl renew cert-1
  19. The error is still there:

    500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 500.

Looking at mitmproxy's UI, you can also see that no /reset call was made.

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions