[7883] - Fix uncontrolled cardinality for Prometheus metrics#8109
Conversation
…add tests Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
|
Hi @mladen-rusev-cyberark. Thanks for your PR. I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a critical issue where ACME HTTP metrics had unbounded cardinality due to using dynamic URL paths as Prometheus labels. The fix replaces the high-cardinality path label with a bounded action label derived from logical ACME operations, significantly improving Prometheus performance and memory efficiency.
Key changes include:
- Replace
pathlabel withactionlabel in ACME metrics to prevent cardinality explosion - Implement context-based action propagation from ACME client methods to HTTP instrumentation
- Add comprehensive test coverage for the new metric labeling behavior
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/metrics/metrics.go | Updates ACME metric definitions to use action instead of path labels and adds public accessors for testing |
| pkg/acme/client/middleware/logger.go | Injects bounded action names into request contexts for all ACME operations |
| pkg/acme/client/http_test.go | Adds comprehensive tests validating metric labeling and accumulation across different scenarios |
| pkg/acme/client/http.go | Modifies HTTP transport to extract action from context instead of processing URL paths |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
|
@erikgb Hi Erik, we were discussing today on how we would like this to be tested. Do you think that the added unit tests are exhaustive enough? |
wallrj-cyberark
left a comment
There was a problem hiding this comment.
After digging into this, I wonder if we should just drop the path label or set its value to "deprecated". What use is it? And what use will the action label be?
Suggest writing a paragraph explaining how you expect users will make use of the new action label. Will it actually be useful to know the rate and average request duration of get_order requests?
Here's the original motivation for adding the metrics:
And I've linked elsewhere to the original discussion about the path label when these metrics were first introduced.
wallrj-cyberark
left a comment
There was a problem hiding this comment.
I forgot to link to this which I found while reading the background to this PR.
The prometheus client_golang module has various helpers for instrumenting the http client,
including a mechanism for capturing labels from the context of the request:
I'm necessarily suggesting using that here in this PR, but perhaps in future we could adopt the promhttp helper functions in place of our own metrics round tripper.
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
|
How to proceed with this PR was discussed during open source standup today. I will try to create a short summary of the discussion and PR comments as to not lose the added context and resources which @wallrj-cyberark provided.
Some things found which are out of scope:
|
Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
|
Was that just a test flake. Seems suspicious that the metrics related tests should fail. Triggering a retest anyway to see if it was a flake. /retest |
…nitely otherwise) Signed-off-by: Mladen Rusev <mladen.rusev@venafi.com>
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wallrj, wallrj-cyberark The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR contains the following updates: | Package | Update | Change | |---|---|---| | [cert-manager](https://cert-manager.io) ([source](https://github.com/cert-manager/cert-manager)) | minor | `v1.18.2` -> `v1.19.0` | --- ### Release Notes <details> <summary>cert-manager/cert-manager (cert-manager)</summary> ### [`v1.19.0`](https://github.com/cert-manager/cert-manager/releases/tag/v1.19.0) [Compare Source](cert-manager/cert-manager@v1.18.2...v1.19.0) cert-manager is the easiest way to automatically manage certificates in Kubernetes and OpenShift clusters. This release focuses on expanding platform compatibility, improving deployment flexibility, enhancing observability, and addressing key reliability issues. > 📖 Read the full release notes at cert-manager.io: <https://cert-manager.io/docs/releases/release-notes/release-notes-1.19> Changes since `v1.18.0`: #### Feature - Add IPv6 rules to the default network policy ([#​7726](cert-manager/cert-manager#7726), [@​jcpunk](https://github.com/jcpunk)) - Add `global.nodeSelector` to helm chart to allow for a single `nodeSelector` to be set across all services. ([#​7818](cert-manager/cert-manager#7818), [@​StingRayZA](https://github.com/StingRayZA)) - Add a feature gate to default to Ingress `pathType` `Exact` in ACME HTTP01 Ingress challenge solvers. ([#​7795](cert-manager/cert-manager#7795), [@​sspreitzer](https://github.com/sspreitzer)) - Add generated `applyconfigurations` allowing clients to make type-safe server-side apply requests for cert-manager resources. ([#​7866](cert-manager/cert-manager#7866), [@​erikgb](https://github.com/erikgb)) - Added API defaults to issuer references group (cert-manager.io) and kind (Issuer). ([#​7414](cert-manager/cert-manager#7414), [@​erikgb](https://github.com/erikgb)) - Added `certmanager_certificate_challenge_status` Prometheus metric. ([#​7736](cert-manager/cert-manager#7736), [@​hjoshi123](https://github.com/hjoshi123)) - Added `protocol` field for `rfc2136` DNS01 provider ([#​7881](cert-manager/cert-manager#7881), [@​hjoshi123](https://github.com/hjoshi123)) - Added experimental field `hostUsers` flag to all pods. Not set by default. ([#​7973](cert-manager/cert-manager#7973), [@​hjoshi123](https://github.com/hjoshi123)) - Support configurable resource requests and limits for ACME HTTP01 solver pods through ClusterIssuer and Issuer specifications, allowing granular resource management that overrides global `--acme-http01-solver-resource-*` settings. ([#​7972](cert-manager/cert-manager#7972), [@​lunarwhite](https://github.com/lunarwhite)) - The `CAInjectorMerging` feature has been promoted to BETA and is now enabled by default ([#​8017](cert-manager/cert-manager#8017), [@​ThatsMrTalbot](https://github.com/ThatsMrTalbot)) - The controller, webhook and ca-injector now log their version and git commit on startup for easier debugging and support. ([#​8072](cert-manager/cert-manager#8072), [@​prasad89](https://github.com/prasad89)) - Updated `certificate` metrics to the collector approach. ([#​7856](cert-manager/cert-manager#7856), [@​hjoshi123](https://github.com/hjoshi123)) #### Bug or Regression - ACME: Increased challenge authorization timeout to 2 minutes to fix `error waiting for authorization` ([#​7796](cert-manager/cert-manager#7796), [@​hjoshi123](https://github.com/hjoshi123)) - BUGFIX: permitted URI domains were incorrectly used to set the excluded URI domains in the CSR's name constraints ([#​7816](cert-manager/cert-manager#7816), [@​kinolaev](https://github.com/kinolaev)) - Enforced ACME HTTP-01 solver validation to properly reject configurations when multiple ingress options (`class`, `ingressClassName`, `name`) are specified simultaneously ([#​8021](cert-manager/cert-manager#8021), [@​lunarwhite](https://github.com/lunarwhite)) - Increase maximum sizes of PEM certificates and chains which can be parsed in cert-manager, to handle leaf certificates with large numbers of DNS names or other identities ([#​7961](cert-manager/cert-manager#7961), [@​SgtCoDFish](https://github.com/SgtCoDFish)) - Reverted adding the `global.rbac.disableHTTPChallengesRole` Helm option. ([#​7836](cert-manager/cert-manager#7836), [@​inteon](https://github.com/inteon)) - This change removes the `path` label of core ACME client metrics and will require users to update their monitoring dashboards and alerting rules if using those metrics. ([#​8109](cert-manager/cert-manager#8109), [@​mladen-rusev-cyberark](https://github.com/mladen-rusev-cyberark)) - Use the latest version of `ingress-nginx` in E2E tests to ensure compatibility ([#​7792](cert-manager/cert-manager#7792), [@​wallrj](https://github.com/wallrj)) #### Other (Cleanup or Flake) - Helm: Fix naming template of `tokenrequest` RoleBinding resource to improve consistency ([#​7761](cert-manager/cert-manager#7761), [@​lunarwhite](https://github.com/lunarwhite)) - Improve error messages when certificates, CRLs or private keys fail admission due to malformed or missing PEM data ([#​7928](cert-manager/cert-manager#7928), [@​SgtCoDFish](https://github.com/SgtCoDFish)) - Major upgrade of Akamai SDK. NOTE: The new version has not been fully tested end-to-end due to the lack of cloud infrastructure. ([#​8003](cert-manager/cert-manager#8003), [@​hjoshi123](https://github.com/hjoshi123)) - Update kind images to include the Kubernetes 1.33 node image ([#​7786](cert-manager/cert-manager#7786), [@​wallrj](https://github.com/wallrj)) - Use `maps.Copy` for cleaner map handling ([#​8092](cert-manager/cert-manager#8092), [@​quantpoet](https://github.com/quantpoet)) - Vault: Migrate Vault E2E add-on tests from deprecated `vault-client-go` to the new `vault/api` client. ([#​8059](cert-manager/cert-manager#8059), [@​armagankaratosun](https://github.com/armagankaratosun)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MS4xMzUuNCIsInVwZGF0ZWRJblZlciI6IjQxLjEzNS40IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJjaGFydCJdfQ==--> Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/1711 Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net> Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
|
@mladen-rusev-cyberark We have released this. Please test and feedback: https://github.com/cert-manager/cert-manager/releases/tag/v1.19.1 |
This PR addresses issue #7883 where the ACME client's Prometheus metrics could grow with unbounded cardinality, leading to performance degradation and high memory usage in Prometheus.
Rationale
The existing instrumentation for ACME client metrics used the request's URL
pathas a label. ApathProcessorfunction attempted to normalize this by taking the first two URL segments. While this heuristic worked for ACME CAs like Let's Encrypt (e.g.,/acme/orders/<id>), it failed for others like Google Public CA, whose URL structures are different (e.g.,/orders/<id>).This divergence meant that for some CAs, unique order/challenge identifiers were still included in the
pathlabel, creating a new time series for nearly every step of the certificate issuance process.This change fixes the issue by replacing the brittle, high-cardinality
pathlabel with a new, low-cardinalityactionlabel. Theactionis a string representing the logical ACME operation being performed (e.g.,get_challenge,authorize_order), ensuring a stable and predictable number of metric series regardless of the CA's URL structure.Implementation Details
context.Contextwithin theLoggermiddleware (pkg/acme/client/middleware/logger.go). This cleanly separates the instrumentation concern from the underlying ACME client logic.RoundTripperinpkg/acme/client/http.gonow reads theactionvalue from the context using theAcmeActionLabelkey.actionis present, it's used as a label. A fallback ofunnamed_actionis used for any requests where the context value is not set, ensuring robustness. The brittlepathProcessorfunction has been removed.pkg/acme/client/http_test.goto validate the new labeling scheme, including context handling, the fallback mechanism, and counter accumulation, using theprometheus/testutilpackage for precise validation.BREAKING CHANGE
This change modifies the labels of core ACME client metrics and will require users to update their monitoring dashboards and alerting rules if using those metrics.
Output from changed metrics test
pkg/acme/client/http_test.gofrom
github.com/prometheus/client_golang@v1.23.2/prometheus/testutil/testutil.go:303Pull Request Motivation
Kind
/kind bug
Release Note
CyberArk tracker: VC-45587