Page MenuHomePhabricator

cert-manager created multiple CertificateRequest objects with the same certificate-revision
Closed, ResolvedPublic

Description

While looking at T303932 I noticed that the toolhub certificate in staging-codfw is expired since >30 days.

From the data in API objects is seems as if the certificate was triggered for renewal 2022-02-06T17:00:03Z which led to the creation of CertificateRequest/toolhub-l8xjm first (2022-02-06T17:01:59Z) and CertificateRequest/toolhub-wvz2q second (2022-02-06T17:04:19Z), both sharing the same cert-manager.io/certificate-revision: 49 and cert-manager.io/private-key-secret-name: toolhub-rrvtj . See attached YAML files (which have certificate data redacted).

During that time we had some pretty elevated latency on the kubernetes apiserver (mainly CREATE and UPDATE on cert-manager.io/certificaterequest resources and apparently some connectivity issues with the kubernetes apiserver, as seen in cert-manager.log below.

Deleting both certificaterequest objects fixed the situation. A new one was created (again cert-manager.io/certificate-revision: 49 and cert-manager.io/private-key-secret-name: toolhub-rrvtj), so the certificate got refreshed.

cert-manager.log
I0206 17:00:03.399283       1 conditions.go:201] Setting lastTransitionTime for Certificate "toolhub" condition "Issuing" to 2022-02-06 17:00:03.399274083 +0000 UTC m=+10735.693974692
I0206 17:00:03.399216       1 trigger_controller.go:181] cert-manager/controller/certificates-trigger "msg"="Certificate must be re-issued" "key"="istio-system/toolhub" "message"="Renewing certificate as renewal was scheduled at 2022-02-06 17:00:00 +0000 UTC" "reason"="Renewing"
E0206 17:00:20.630289       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/validate?timeout=10s: context deadline exceeded" "key"="istio-system/toolhub" 
E0206 17:00:31.650166       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: context deadline exceeded" "key"="istio-system/toolhub" 
E0206 17:00:43.668839       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: context deadline exceeded" "key"="istio-system/toolhub" 
E0206 17:00:48.729312       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: dial tcp 10.192.76.185:443: connect: connection refused" "key"="istio-system/toolhub" 
E0206 17:00:57.816629       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: dial tcp 10.192.76.185:443: connect: connection refused" "key"="istio-system/toolhub" 
E0206 17:02:20.402566       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="failed whilst waiting for CertificateRequest to exist - this may indicate an apiserver running slowly. Request will be retried" "key"="istio-system/toolhub" 
E0206 17:06:26.899074       1 controller.go:163] cert-manager/controller/certificates-request-manager "msg"="re-queuing item due to error processing" "error"="failed whilst waiting for CertificateRequest to exist - this may indicate an apiserver running slowly. Request will be retried" "key"="istio-system/toolhub" 
I0206 17:07:34.907870       1 trace.go:205] Trace[861803746]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (06-Feb-2022 17:07:21.603) (total time: 13304ms):
Trace[861803746]: [13.30407288s] [13.30407288s] END
Trace[861803746]: ---"Objects listed" 13303ms (17:07:34.906)
I0206 17:07:57.199752       1 trace.go:205] Trace[782326899]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (06-Feb-2022 17:07:35.007) (total time: 22100ms):
Trace[782326899]: [22.100469212s] [22.100469212s] END
Trace[782326899]: ---"Objects listed" 22095ms (17:07:57.102)
I0206 17:08:05.299156       1 conditions.go:261] Setting lastTransitionTime for CertificateRequest "toolhub-l8xjm" condition "Approved" to 2022-02-06 17:08:04.500992384 +0000 UTC m=+44.594675153
I0206 17:08:05.199563       1 conditions.go:261] Setting lastTransitionTime for CertificateRequest "toolhub-wvz2q" condition "Approved" to 2022-02-06 17:08:03.500823508 +0000 UTC m=+43.594506243
I0206 17:08:05.904575       1 trace.go:205] Trace[850423915]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (06-Feb-2022 17:07:35.006) (total time: 30897ms):
Trace[850423915]: ---"Objects listed" 30895ms (17:08:05.902)
Trace[850423915]: [30.897705171s] [30.897705171s] END
I0206 17:08:08.699125       1 requestmanager_controller.go:210] cert-manager/controller/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="istio-system/toolhub" 
I0206 17:08:09.201178       1 requestmanager_controller.go:210] cert-manager/controller/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="istio-system/toolhub" 
I0206 17:08:09.300472       1 requestmanager_controller.go:210] cert-manager/controller/certificates-request-manager "msg"="Multiple matching CertificateRequest resources exist, delete one of them. This is likely an error and should be reported on the issue tracker!" "key"="istio-system/toolhub" 
E0206 17:08:10.599180       1 controller.go:163] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to error processing" "error"="multiple CertificateRequests were found for the 'next' revision 49, issuance is skipped until there are no more duplicates" "key"="istio-system/toolhub" 
E0206 17:08:13.400912       1 controller.go:163] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to error processing" "error"="multiple CertificateRequests were found for the 'next' revision 49, issuance is skipped until there are no more duplicates" "key"="istio-system/toolhub"
cfss-issuer.log
2022-02-06T17:02:00.026484+00:00 controller.certificaterequest istio-system toolhub-l8xjm CertificateRequest has not been approved yet. Ignoring.
2022-02-06T17:02:00.423743+00:00 controller.certificaterequest istio-system toolhub-l8xjm CertificateRequest has not been approved yet. Ignoring.
2022-02-06T17:04:19.921125+00:00 controller.certificaterequest istio-system toolhub-wvz2q CertificateRequest has not been approved yet. Ignoring.
2022-02-06T17:04:20.280381+00:00 controller.certificaterequest istio-system toolhub-wvz2q CertificateRequest has not been approved yet. Ignoring.
2022-02-06T17:08:06.328312+00:00 controller.certificaterequest istio-system toolhub-wvz2q Initialising Ready condition
2022-02-06T17:08:07.376063+00:00 controller.certificaterequest istio-system toolhub-l8xjm Initialising Ready condition
2022-02-06T17:08:07.441243+00:00 controller.certificaterequest istio-system toolhub-wvz2q Signing cert with k8s_staging discovery true
2022-02-06T17:08:08.047227+00:00 controller.certificaterequest istio-system toolhub-l8xjm Signing cert with k8s_staging discovery true
2022-02-06T17:08:08.536874+00:00 controller.certificaterequest istio-system toolhub-l8xjm CertificateRequest is Ready. Ignoring.
2022-02-06T17:08:08.536832+00:00 controller.certificaterequest istio-system toolhub-wvz2q CertificateRequest is Ready. Ignoring.

Action items:

  • Add an alert for expiring certificates

Upstream issue:

Event Timeline

JMeybohm triaged this task as Medium priority.
JMeybohm renamed this task from cert-manager created multiple CertificateRequest objects for the same next revision to cert-manager created multiple CertificateRequest objects with the same next revision.Mar 17 2022, 4:15 PM
JMeybohm renamed this task from cert-manager created multiple CertificateRequest objects with the same next revision to cert-manager created multiple CertificateRequest objects with the same certificate-revision.Mar 17 2022, 4:44 PM
JMeybohm updated the task description. (Show Details)

Change 771687 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Introduce cert-manager alerts

https://gerrit.wikimedia.org/r/771687

Change 771687 merged by JMeybohm:

[operations/alerts@master] Introduce cert-manager alerts

https://gerrit.wikimedia.org/r/771687

JMeybohm lowered the priority of this task from Medium to Low.Mar 21 2022, 1:56 PM
JMeybohm updated the task description. (Show Details)

Lowering the priority as alerts are in place now and I don't have high hopes regarding the upstream issue as our cert-manager version (the last one compatible with k8s 1.16) is EOL since January.

Change 775277 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] Add controller_sync_error_count metric

https://gerrit.wikimedia.org/r/775277

Change 775277 merged by JMeybohm:

[operations/docker-images/production-images@master] Add controller_sync_error_count metric

https://gerrit.wikimedia.org/r/775277

Change 776953 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Update cert-manager in staging to 1.5.4-3

https://gerrit.wikimedia.org/r/776953

Change 776971 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Update cert-manager to 1.5.4-3

https://gerrit.wikimedia.org/r/776971

Change 776953 merged by jenkins-bot:

[operations/deployment-charts@master] Update cert-manager in staging to 1.5.4-3

https://gerrit.wikimedia.org/r/776953

Change 776971 merged by jenkins-bot:

[operations/deployment-charts@master] Update cert-manager to 1.5.4-3

https://gerrit.wikimedia.org/r/776971

Change 896111 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] cert-manager: Enable stable certificate request names in staging

https://gerrit.wikimedia.org/r/896111

Change 896111 merged by jenkins-bot:

[operations/deployment-charts@master] cert-manager: Enable stable certificate request names in staging

https://gerrit.wikimedia.org/r/896111

Change 897950 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] cert-manager: Actually run 1.10.1 with chart version 1.10.x

https://gerrit.wikimedia.org/r/897950

Change 897950 merged by jenkins-bot:

[operations/deployment-charts@master] cert-manager: Actually run 1.10.1 with chart version 1.10.x

https://gerrit.wikimedia.org/r/897950

Change 898686 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] cert-manager: Actually run 1.10.1 with chart version 1.10.x

https://gerrit.wikimedia.org/r/898686

Change 898686 merged by jenkins-bot:

[operations/deployment-charts@master] cert-manager: Actually run 1.10.1 with chart version 1.10.x

https://gerrit.wikimedia.org/r/898686

Change 898770 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] cert-manager: Run cert-manager 1.10.1 in ml-staging

https://gerrit.wikimedia.org/r/898770

Change 898770 merged by jenkins-bot:

[operations/deployment-charts@master] cert-manager: Run cert-manager 1.10.1 in ml-staging

https://gerrit.wikimedia.org/r/898770

This is running in wikikube and ml-staging now. Will give it a couple of days (e.g. cert refreshes) and deploy to prod then.

Change 899515 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] cert-manager: Update cert-manager to 1.10.1

https://gerrit.wikimedia.org/r/899515

Change 899515 merged by jenkins-bot:

[operations/deployment-charts@master] cert-manager: Update cert-manager to 1.10.1

https://gerrit.wikimedia.org/r/899515

This is running in wikikube and ml-staging now. Will give it a couple of days (e.g. cert refreshes) and deploy to prod then.

A couple of cert refreshes (and additions) passed already so I rolled this out to all clusters.