Skip to content

[7883] - Fix uncontrolled cardinality for Prometheus metrics#8109

Merged
cert-manager-prow[bot] merged 11 commits intocert-manager:masterfrom
mladen-rusev-cyberark:7883-uncontrolled-cardinality-prometheus-metrics
Oct 2, 2025
Merged

[7883] - Fix uncontrolled cardinality for Prometheus metrics#8109
cert-manager-prow[bot] merged 11 commits intocert-manager:masterfrom
mladen-rusev-cyberark:7883-uncontrolled-cardinality-prometheus-metrics

Conversation

@mladen-rusev-cyberark
Copy link
Copy Markdown

@mladen-rusev-cyberark mladen-rusev-cyberark commented Sep 24, 2025

This PR addresses issue #7883 where the ACME client's Prometheus metrics could grow with unbounded cardinality, leading to performance degradation and high memory usage in Prometheus.

Rationale

The existing instrumentation for ACME client metrics used the request's URL path as a label. A pathProcessor function attempted to normalize this by taking the first two URL segments. While this heuristic worked for ACME CAs like Let's Encrypt (e.g., /acme/orders/<id>), it failed for others like Google Public CA, whose URL structures are different (e.g., /orders/<id>).

This divergence meant that for some CAs, unique order/challenge identifiers were still included in the path label, creating a new time series for nearly every step of the certificate issuance process.

This change fixes the issue by replacing the brittle, high-cardinality path label with a new, low-cardinality action label. The action is a string representing the logical ACME operation being performed (e.g., get_challenge, authorize_order), ensuring a stable and predictable number of metric series regardless of the CA's URL structure.

Implementation Details

  1. Context Propagation via Middleware: The logical ACME operation name is now added to the context.Context within the Logger middleware (pkg/acme/client/middleware/logger.go). This cleanly separates the instrumentation concern from the underlying ACME client logic.
  2. Metric Instrumentation: The instrumented HTTP RoundTripper in pkg/acme/client/http.go now reads the action value from the context using the AcmeActionLabel key.
  3. Label Update: If the action is present, it's used as a label. A fallback of unnamed_action is used for any requests where the context value is not set, ensuring robustness. The brittle pathProcessor function has been removed.
  4. Testing: New unit tests have been added in pkg/acme/client/http_test.go to validate the new labeling scheme, including context handling, the fallback mechanism, and counter accumulation, using the prometheus/testutil package for precise validation.

BREAKING CHANGE

This change modifies the labels of core ACME client metrics and will require users to update their monitoring dashboards and alerting rules if using those metrics.

Output from changed metrics test pkg/acme/client/http_test.go

from github.com/prometheus/client_golang@v1.23.2/prometheus/testutil/testutil.go:303

# HELP certmanager_http_acme_client_request_count The number of requests made by the ACME client.
# TYPE certmanager_http_acme_client_request_count counter
certmanager_http_acme_client_request_count{action="get_directory",host="127.0.0.1:34759",method="GET",scheme="http",status="200"} 1
# HELP certmanager_http_acme_client_request_count The number of requests made by the ACME client.
# TYPE certmanager_http_acme_client_request_count counter
certmanager_http_acme_client_request_count{action="finalize_order",host="127.0.0.1:34091",method="POST",scheme="http",status="200"} 3

Pull Request Motivation

Kind

/kind bug

Release Note

This change removes the `path` label of core ACME client metrics and will require users to update their monitoring dashboards and alerting rules if using those metrics.

CyberArk tracker: VC-45587

…add tests

Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
@cert-manager-prow cert-manager-prow bot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. area/acme Indicates a PR directly modifies the ACME Issuer code area/monitoring Indicates a PR or issue relates to monitoring needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 24, 2025
@cert-manager-prow
Copy link
Copy Markdown
Contributor

Hi @mladen-rusev-cyberark. Thanks for your PR.

I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cert-manager-prow cert-manager-prow bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 24, 2025
@mladen-rusev-cyberark mladen-rusev-cyberark marked this pull request as ready for review September 24, 2025 10:08
@cert-manager-prow cert-manager-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2025
@erikgb
Copy link
Copy Markdown
Member

erikgb commented Sep 24, 2025

/ok-to-test

@cert-manager-prow cert-manager-prow bot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 24, 2025
Mladen Rusev added 3 commits September 25, 2025 10:37
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
Mladen Rusev added 2 commits September 25, 2025 17:22
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
@wallrj wallrj requested a review from Copilot September 26, 2025 12:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a critical issue where ACME HTTP metrics had unbounded cardinality due to using dynamic URL paths as Prometheus labels. The fix replaces the high-cardinality path label with a bounded action label derived from logical ACME operations, significantly improving Prometheus performance and memory efficiency.

Key changes include:

  • Replace path label with action label in ACME metrics to prevent cardinality explosion
  • Implement context-based action propagation from ACME client methods to HTTP instrumentation
  • Add comprehensive test coverage for the new metric labeling behavior

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/metrics/metrics.go Updates ACME metric definitions to use action instead of path labels and adds public accessors for testing
pkg/acme/client/middleware/logger.go Injects bounded action names into request contexts for all ACME operations
pkg/acme/client/http_test.go Adds comprehensive tests validating metric labeling and accumulation across different scenarios
pkg/acme/client/http.go Modifies HTTP transport to extract action from context instead of processing URL paths

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
@wallrj-cyberark wallrj-cyberark added the cybr Used by CyberArk-employed maintainers to report to line management what's being worked on. label Sep 26, 2025
@cert-manager-prow cert-manager-prow bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 26, 2025
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
@mladen-rusev-cyberark
Copy link
Copy Markdown
Author

@erikgb Hi Erik, we were discussing today on how we would like this to be tested. Do you think that the added unit tests are exhaustive enough?

Copy link
Copy Markdown
Member

@wallrj-cyberark wallrj-cyberark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging into this, I wonder if we should just drop the path label or set its value to "deprecated". What use is it? And what use will the action label be?

Suggest writing a paragraph explaining how you expect users will make use of the new action label. Will it actually be useful to know the rate and average request duration of get_order requests?

Here's the original motivation for adding the metrics:

And I've linked elsewhere to the original discussion about the path label when these metrics were first introduced.

Copy link
Copy Markdown
Member

@wallrj-cyberark wallrj-cyberark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to link to this which I found while reading the background to this PR.
The prometheus client_golang module has various helpers for instrumenting the http client,
including a mechanism for capturing labels from the context of the request:

I'm necessarily suggesting using that here in this PR, but perhaps in future we could adopt the promhttp helper functions in place of our own metrics round tripper.

Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
@mladen-rusev-cyberark
Copy link
Copy Markdown
Author

How to proceed with this PR was discussed during open source standup today. I will try to create a short summary of the discussion and PR comments as to not lose the added context and resources which @wallrj-cyberark provided.

  1. The reason why the path metric was introduced in the first place was to help solve an issue where cert-manager was creating a massive amount of requests to services like Let's encrypt - link. The PR author pointed out the exact problem which this PR is aiming to fix - the path metric may capture the ID in the url. We were questioning the overall usefulness of these metrics and if they were of interest to cert-manager users.
  2. We discussed options to avoid breaking changes by deprecating this metric first. In the end we decided to remove the path metric outright, document that in the release notes as a breaking change and update the help text of the metrics.
  3. Richard pointed out ways users can use relabeling to solve issues with high cardinality labels here

Some things found which are out of scope:

  1. We ignore the error when writing the metrics. The status code will be 999 in such cases, but unsure what the significance of 999 is.- link
	// Make the request using the wrapped RoundTripper.
	resp, err := it.wrappedRT.RoundTrip(req)
	if resp != nil {
		statusCode = resp.StatusCode
	}
  1. A future consideration

The prometheus client_golang module has various helpers for instrumenting the http client,
including a mechanism for capturing labels from the context of the request:
https://github.com/prometheus/client_golang/blob/main/prometheus/promhttp/option_test.go#L73-L128
I'm not necessarily suggesting using that here in this PR, but perhaps in future we could adopt the promhttp helper functions in place of our own metrics round tripper.

Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
@wallrj-cyberark
Copy link
Copy Markdown
Member

wallrj-cyberark commented Oct 2, 2025

Was that just a test flake. Seems suspicious that the metrics related tests should fail.

{Failed  panic: test timed out after 10m0s
	running tests:
		TestMetricsController (8m40s)

goroutine 27935 [running]:

Triggering a retest anyway to see if it was a flake.

/retest

…nitely otherwise)

Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
@cert-manager-prow cert-manager-prow bot added the area/testing Issues relating to testing label Oct 2, 2025
@wallrj
Copy link
Copy Markdown
Member

wallrj commented Oct 2, 2025

/approve
/lgtm

@cert-manager-prow cert-manager-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2025
@cert-manager-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wallrj, wallrj-cyberark

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cert-manager-prow cert-manager-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2025
@cert-manager-prow cert-manager-prow bot merged commit a14a101 into cert-manager:master Oct 2, 2025
6 checks passed
alexlebens pushed a commit to alexlebens/infrastructure that referenced this pull request Oct 8, 2025
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [cert-manager](https://cert-manager.io) ([source](https://github.com/cert-manager/cert-manager)) | minor | `v1.18.2` -> `v1.19.0` |

---

### Release Notes

<details>
<summary>cert-manager/cert-manager (cert-manager)</summary>

### [`v1.19.0`](https://github.com/cert-manager/cert-manager/releases/tag/v1.19.0)

[Compare Source](cert-manager/cert-manager@v1.18.2...v1.19.0)

cert-manager is the easiest way to automatically manage certificates in Kubernetes and OpenShift clusters.

This release focuses on expanding platform compatibility, improving deployment flexibility, enhancing observability, and addressing key reliability issues.

> 📖  Read the full release notes at cert-manager.io: <https://cert-manager.io/docs/releases/release-notes/release-notes-1.19>

Changes since `v1.18.0`:

#### Feature

- Add IPv6 rules to the default network policy ([#&#8203;7726](cert-manager/cert-manager#7726), [@&#8203;jcpunk](https://github.com/jcpunk))
- Add `global.nodeSelector` to helm chart to allow for a single `nodeSelector` to be set across all services. ([#&#8203;7818](cert-manager/cert-manager#7818), [@&#8203;StingRayZA](https://github.com/StingRayZA))
- Add a feature gate to default to Ingress `pathType` `Exact` in ACME HTTP01 Ingress challenge solvers. ([#&#8203;7795](cert-manager/cert-manager#7795), [@&#8203;sspreitzer](https://github.com/sspreitzer))
- Add generated `applyconfigurations` allowing clients to make type-safe server-side apply requests for cert-manager resources. ([#&#8203;7866](cert-manager/cert-manager#7866), [@&#8203;erikgb](https://github.com/erikgb))
- Added API defaults to issuer references group (cert-manager.io) and kind (Issuer). ([#&#8203;7414](cert-manager/cert-manager#7414), [@&#8203;erikgb](https://github.com/erikgb))
- Added `certmanager_certificate_challenge_status` Prometheus metric. ([#&#8203;7736](cert-manager/cert-manager#7736), [@&#8203;hjoshi123](https://github.com/hjoshi123))
- Added `protocol` field for `rfc2136` DNS01 provider ([#&#8203;7881](cert-manager/cert-manager#7881), [@&#8203;hjoshi123](https://github.com/hjoshi123))
- Added experimental field `hostUsers` flag to all pods. Not set by default. ([#&#8203;7973](cert-manager/cert-manager#7973), [@&#8203;hjoshi123](https://github.com/hjoshi123))
- Support configurable resource requests and limits for ACME HTTP01 solver pods through ClusterIssuer and Issuer specifications, allowing granular resource management that overrides global `--acme-http01-solver-resource-*` settings. ([#&#8203;7972](cert-manager/cert-manager#7972), [@&#8203;lunarwhite](https://github.com/lunarwhite))
- The `CAInjectorMerging` feature has been promoted to BETA and is now enabled by default ([#&#8203;8017](cert-manager/cert-manager#8017), [@&#8203;ThatsMrTalbot](https://github.com/ThatsMrTalbot))
- The controller, webhook and ca-injector now log their version and git commit on startup for easier debugging and support. ([#&#8203;8072](cert-manager/cert-manager#8072), [@&#8203;prasad89](https://github.com/prasad89))
- Updated `certificate` metrics to the collector approach. ([#&#8203;7856](cert-manager/cert-manager#7856), [@&#8203;hjoshi123](https://github.com/hjoshi123))

#### Bug or Regression

- ACME: Increased challenge authorization timeout to 2 minutes to fix `error waiting for authorization` ([#&#8203;7796](cert-manager/cert-manager#7796), [@&#8203;hjoshi123](https://github.com/hjoshi123))
- BUGFIX: permitted URI domains were incorrectly used to set the excluded URI domains in the CSR's name constraints ([#&#8203;7816](cert-manager/cert-manager#7816), [@&#8203;kinolaev](https://github.com/kinolaev))
- Enforced ACME HTTP-01 solver validation to properly reject configurations when multiple ingress options (`class`, `ingressClassName`, `name`) are specified simultaneously ([#&#8203;8021](cert-manager/cert-manager#8021), [@&#8203;lunarwhite](https://github.com/lunarwhite))
- Increase maximum sizes of PEM certificates and chains which can be parsed in cert-manager, to handle leaf certificates with large numbers of DNS names or other identities ([#&#8203;7961](cert-manager/cert-manager#7961), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
- Reverted adding the `global.rbac.disableHTTPChallengesRole` Helm option. ([#&#8203;7836](cert-manager/cert-manager#7836), [@&#8203;inteon](https://github.com/inteon))
- This change removes the `path` label of core ACME client metrics and will require users to update their monitoring dashboards and alerting rules if using those metrics. ([#&#8203;8109](cert-manager/cert-manager#8109), [@&#8203;mladen-rusev-cyberark](https://github.com/mladen-rusev-cyberark))
- Use the latest version of `ingress-nginx` in E2E tests to ensure compatibility ([#&#8203;7792](cert-manager/cert-manager#7792), [@&#8203;wallrj](https://github.com/wallrj))

#### Other (Cleanup or Flake)

- Helm: Fix naming template of `tokenrequest` RoleBinding resource to improve consistency ([#&#8203;7761](cert-manager/cert-manager#7761), [@&#8203;lunarwhite](https://github.com/lunarwhite))
- Improve error messages when certificates, CRLs or private keys fail admission due to malformed or missing PEM data ([#&#8203;7928](cert-manager/cert-manager#7928), [@&#8203;SgtCoDFish](https://github.com/SgtCoDFish))
- Major upgrade of Akamai SDK. NOTE: The new version has not been fully tested end-to-end due to the lack of cloud infrastructure. ([#&#8203;8003](cert-manager/cert-manager#8003), [@&#8203;hjoshi123](https://github.com/hjoshi123))
- Update kind images to include the Kubernetes 1.33 node image ([#&#8203;7786](cert-manager/cert-manager#7786), [@&#8203;wallrj](https://github.com/wallrj))
- Use `maps.Copy` for cleaner map handling ([#&#8203;8092](cert-manager/cert-manager#8092), [@&#8203;quantpoet](https://github.com/quantpoet))
- Vault: Migrate Vault E2E add-on tests from deprecated `vault-client-go` to the new `vault/api` client. ([#&#8203;8059](cert-manager/cert-manager#8059), [@&#8203;armagankaratosun](https://github.com/armagankaratosun))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MS4xMzUuNCIsInVwZGF0ZWRJblZlciI6IjQxLjEzNS40IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJjaGFydCJdfQ==-->

Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/1711
Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net>
Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
@wallrj-cyberark
Copy link
Copy Markdown
Member

@mladen-rusev-cyberark We have released this. Please test and feedback: https://github.com/cert-manager/cert-manager/releases/tag/v1.19.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/acme Indicates a PR directly modifies the ACME Issuer code area/monitoring Indicates a PR or issue relates to monitoring area/testing Issues relating to testing cybr Used by CyberArk-employed maintainers to report to line management what's being worked on. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Uncontrolled cardinality for Prometheus metrics certmanager_http_acme_client_request_*

6 participants