-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Uncontrolled cardinality for Prometheus metrics certmanager_http_acme_client_request_* #7883
Description
Describe the bug:
The cardinality of the ACME client request prometheus metrics can grow unbounded.
This happens because the path metric label contains IDs for some paths, resulting in 2 new path values being used for every issued certificate.
This problem showed up in our cluster, where we manage a lot of ACME certificates using Google PKI.
The order and authorization paths are slightly different between Letsencrypt and Google PKI.
Letsencrypt Order URIs are like /acme/orders/<id>; but Google PKI Order URIs look like /orders/<id>.
The logic to transform the path simply takes the first 2 segments, which is not necessarily useful for all ACME implementations (since they can freely choose the exact URLs to use for everything but the discovery URL)
Because of how prometheus metrics work, these single-use metrics stay around until the cert-manager-controller is restarted.
The following metrics are affected:
certmanager_http_acme_client_request_duration_secondscertmanager_http_acme_client_request_countcertmanager_http_acme_client_request_duration_seconds_countcertmanager_http_acme_client_request_duration_seconds_sum
Expected behaviour:
I would expect that only a single pathlabel is exposed for every ACME operation.
Steps to reproduce the bug:
- Set up cert-manager
- Configure an ACME issuer using Google PKI EAB credentials (See https://cloud.google.com/certificate-manager/docs/public-ca-tutorial)
- Issue a certificate for a domain
- Check the exported prometheus metrics; notice that there is a
certmanager_http_acme_client_request_duration_secondsmetrics where there is an ID in thepathlabel.
Anything else we need to know?:
I am afraid that this is not really straightforward to fix at the HTTP client level, because it has no visibility into the meanings of the path. It would also be quite difficult to try to detect which path components are actually dynamic, because IDs don't have to follow a fixed structure.
I think it might be more interesting to instead instrument on the ACME client interface level, and use the logical operation as a label instead of the specific path.
Environment details:
- Kubernetes version: 1.31.7
- Cloud-provider/provisioner: Scaleway
- cert-manager version: 1.17.2
- Install method: helm
/kind bug
CyberArk tracker: VC-45083