Skip to content

Uncontrolled cardinality for Prometheus metrics certmanager_http_acme_client_request_* #7883

@vierbergenlars

Description

@vierbergenlars

Describe the bug:

The cardinality of the ACME client request prometheus metrics can grow unbounded.
This happens because the path metric label contains IDs for some paths, resulting in 2 new path values being used for every issued certificate.

This problem showed up in our cluster, where we manage a lot of ACME certificates using Google PKI.

The order and authorization paths are slightly different between Letsencrypt and Google PKI.

Letsencrypt Order URIs are like /acme/orders/<id>; but Google PKI Order URIs look like /orders/<id>.

The logic to transform the path simply takes the first 2 segments, which is not necessarily useful for all ACME implementations (since they can freely choose the exact URLs to use for everything but the discovery URL)

Because of how prometheus metrics work, these single-use metrics stay around until the cert-manager-controller is restarted.

The following metrics are affected:

  • certmanager_http_acme_client_request_duration_seconds
  • certmanager_http_acme_client_request_count
  • certmanager_http_acme_client_request_duration_seconds_count
  • certmanager_http_acme_client_request_duration_seconds_sum

Expected behaviour:

I would expect that only a single pathlabel is exposed for every ACME operation.

Steps to reproduce the bug:

  1. Set up cert-manager
  2. Configure an ACME issuer using Google PKI EAB credentials (See https://cloud.google.com/certificate-manager/docs/public-ca-tutorial)
  3. Issue a certificate for a domain
  4. Check the exported prometheus metrics; notice that there is a certmanager_http_acme_client_request_duration_seconds metrics where there is an ID in the path label.

Anything else we need to know?:

I am afraid that this is not really straightforward to fix at the HTTP client level, because it has no visibility into the meanings of the path. It would also be quite difficult to try to detect which path components are actually dynamic, because IDs don't have to follow a fixed structure.

I think it might be more interesting to instead instrument on the ACME client interface level, and use the logical operation as a label instead of the specific path.

Environment details:

  • Kubernetes version: 1.31.7
  • Cloud-provider/provisioner: Scaleway
  • cert-manager version: 1.17.2
  • Install method: helm

/kind bug

CyberArk tracker: VC-45083

Metadata

Metadata

Labels

area/acmeIndicates a PR directly modifies the ACME Issuer codecybrUsed by CyberArk-employed maintainers to report to line management what's being worked on.good first issueDenotes an issue ready for a new contributor, according to the "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.priority/backlogHigher priority than priority/awaiting-more-evidence.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions