Skip to content

Reduce load on the Kubernetes API server and reduce the peak memory use of the cert-manager components by enabling the use of the WatchList (Streaming Lists) feature#7175

Merged
cert-manager-prow[bot] merged 1 commit intocert-manager:masterfrom
wallrj:3748-enable-watchlist-streaming-lists
Sep 24, 2024

Conversation

@wallrj
Copy link
Copy Markdown
Member

@wallrj wallrj commented Jul 12, 2024

Allow cert-manager users to enable the WatchListClient feature in client-go, so that they can experiment with the feature and evaluate its effect on the memory usage of the components.
They will also have to enable the WatchList feature of their Kubernetes API server.

  • This PR merges the Client-Go Feature Flags into the feature flags of each of the cert-manager components.
    Copying the technique and copying some of the code used by the kube-controller-manager. See:
  • When the client-go WatchListClient feature is eventually promoted to GA, the feature flag will be hidden, but the flag will continue to work if enabled, but will error if disabled, because when GA, the it will no longer be possible to turn the implementation off.
  • I expect that new client-go feature flags will be introduced in future, which may or may not be applicable to cert-manager, but these will be disabled by default and should not affect cert-manager users.
  • We will need to be aware that the client-go features are being enabled in our E2E tests.
  • And we should update the documentation to explain which of the features are client-go features and link to the documentation for those features.
  • The ALPHA WatchList feature of the kube-apiserver is available since v1.27.
  • The ALPHA WatchListClient feature of client-go was introduced in v0.28.0, enabled with a non-standard ENABLE_CLIENT_GO_WATCH_LIST_ALPHA environment variable.
  • The BETA WatchListClient feature was promoted in v0.31.0, with a new WatchListClient feature flag.
  • It is not clear when the WatchList and WatchListClient features will become GA (enabled by default).
    The following comment from a K8S contributor says:

    ...but only after this feature graduates to GA, which most likely be not sooner than 1.33, although looking at 3157 it's still TBD.

  • I've measured a significant memory reduction in both the controller and in the kube-apiserver when the WatchList feature is enabled. (see testing section below).
  • There is no memory reduction in cainjector, because it only caches the metadata of Secrets.
  • There is no memory reduction in webhook, because it doesn't cache any resources.
  • I did not find any examples of other projects that have enabled this feature:

Fixes: #3748

Background

Release Note

Feature: Add a new `WatchListClient` feature flag to cert-manager controller, cainjector and webhook, which allows the components to use of the ALPHA `WatchList` / Streaming list feature of the Kubernetes API server. This reduces the load on  the Kubernetes API server when cert-manager starts up and reduces the peak memory usage in the cert-manager components.

Testing

Memory reduction in a Kind cluster

I deployed cert-manager from master and from this branch in a Kind cluster with 100MiB Secret resources.
Measured the peak memory usage of all controller, cainjector, webhook using the VmHWM file in /proc/*/status.

component before after
controller 362 229
cainjector 46 44
webhook 48 49
etcd 366 358
kube-apiserver 1134 824

See https://gist.github.com/wallrj/f15ad450f1b3effb107db5e6a01bf03f

Log messages

With --feature-gates=ClientWatchList=true and --v 6 you'll see the WATCH requests with the following query string parameters:

  • resourceVersionMatch=NotOlderThan
  • sendInitialEvents=true
kubectl logs -n cert-manager deploy/cert-manager | fgrep -v -e "starting worker" -e "leaderelection"

I0712 16:40:35.025446 1 round_trippers.go:553] GET https://10.0.0.1:443/api/v1/secrets?allowWatchBookmarks=true&labelSelector=controller.cert-manager.io%2Ffao%3Dtrue&resourceVersionMatch=NotOlderThan&sendInitialEvents=true&timeout=7m37s&timeoutSeconds=457&watch=true 200 OK in 2502 milliseconds
I0712 16:40:35.025528 1 round_trippers.go:553] GET https://10.0.0.1:443/api/v1/secrets?allowWatchBookmarks=true&labelSelector=%21controller.cert-manager.io%2Ffao&resourceVersionMatch=NotOlderThan&sendInitialEvents=true&timeout=8m7s&timeoutSeconds=487&watch=true 200 OK in 2501 milliseconds
I0712 16:40:35.655423 1 reflector.go:798] exiting k8s.io/client-go@v0.30.2/tools/cache/reflector.go:232 Watch because received the bookmark that marks the end of initial events stream, total 6 items received in 3.13200872s
I0712 16:40:35.655514 1 reflector.go:359] Caches populated for *v1.PartialObjectMetadata from k8s.io/client-go@v0.30.2/tools/cache/reflector.go:232
I0712 16:40:35.655697 1 reflector.go:798] exiting k8s.io/client-go@v0.30.2/tools/cache/reflector.go:232 Watch because received

Command Line Flags

You'll see the following new features among the feature flags help output.

$ go run ./cmd/controller --help | fgrep -A 20 -- --feature-gates
      --feature-gates mapStringBool                          A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:
                                                             AdditionalCertificateOutputFormats=true|false (BETA - default=true)
                                                             AllAlpha=true|false (ALPHA - default=false)
                                                             AllBeta=true|false (BETA - default=false)
                                                             ExperimentalCertificateSigningRequestControllers=true|false (ALPHA - default=false)
                                                             ExperimentalGatewayAPISupport=true|false (BETA - default=true)
                                                             InformerResourceVersion=true|false (ALPHA - default=false)
                                                             LiteralCertificateSubject=true|false (BETA - default=true)
                                                             NameConstraints=true|false (ALPHA - default=false)
                                                             OtherNames=true|false (ALPHA - default=false)
                                                             SecretsFilteredCaching=true|false (BETA - default=true)
                                                             ServerSideApply=true|false (ALPHA - default=false)
                                                             StableCertificateRequestName=true|false (BETA - default=true)
                                                             UseCertificateRequestBasicConstraints=true|false (ALPHA - default=false)
                                                             UseDomainQualifiedFinalizer=true|false (ALPHA - default=false)
                                                             ValidateCAA=true|false (ALPHA - default=false)
                                                             WatchListClient=true|false (BETA - default=false)
...
$ go run ./cmd/cainjector --help

      --feature-gates mapStringBool                          A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:
                                                             AllAlpha=true|false (ALPHA - default=false)
                                                             AllBeta=true|false (BETA - default=false)
                                                             InformerResourceVersion=true|false (ALPHA - default=false)
                                                             ServerSideApply=true|false (ALPHA - default=false)
                                                             WatchListClient=true|false (BETA - default=false)
$ go run ./cmd/webhook --help | fgrep -A 20 -- --feature-gates
      --feature-gates mapStringBool                          A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:
                                                             AdditionalCertificateOutputFormats=true|false (BETA - default=true)
                                                             AllAlpha=true|false (ALPHA - default=false)
                                                             AllBeta=true|false (BETA - default=false)
                                                             InformerResourceVersion=true|false (ALPHA - default=false)
                                                             LiteralCertificateSubject=true|false (BETA - default=true)
                                                             NameConstraints=true|false (ALPHA - default=false)
                                                             OtherNames=true|false (ALPHA - default=false)
                                                             WatchListClient=true|false (BETA - default=false)

Testing cert-manager 1.15

I took the 1.15 binaries and enabled the client watch list feature using the old ENABLE_CLIENT_GO_WATCH_LIST_ALPHA environment variable.
With 100MiB of Secrets

  • controller peak memory (VmHWM) dropped from 436 to 212 MB.
  • kube-apiserver peak memory dropped from 1134 to 772 MB.

@cert-manager-prow cert-manager-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. area/deploy Indicates a PR modifies deployment configuration needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 12, 2024
@wallrj wallrj changed the title WIP: Enable the WatchList (Streaming Lists) feature WIP: Avoid overloading the Kubernetes API server by enabling the WatchList (Streaming Lists) feature Jul 17, 2024
@inteon inteon changed the title WIP: Avoid overloading the Kubernetes API server by enabling the WatchList (Streaming Lists) feature WIP: Reduce load on the Kubernetes API server by enabling the WatchList (Streaming Lists) feature Aug 14, 2024
@inteon
Copy link
Copy Markdown
Member

inteon commented Aug 26, 2024

@wallrj we are now using the v0.31.0 client-go libraries.

@jsoref
Copy link
Copy Markdown
Contributor

jsoref commented Sep 17, 2024

/kind bug

@cert-manager-prow cert-manager-prow bot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Sep 17, 2024
@inteon inteon self-assigned this Sep 19, 2024
@inteon inteon force-pushed the 3748-enable-watchlist-streaming-lists branch 2 times, most recently from cd44551 to 75f28b0 Compare September 19, 2024 10:12
// features are wired to the existing --feature-gates flag just as all other features
// are. Further, client-go features automatically support the existing mechanisms for
// feature enablement metrics and test overrides.
ca := &clientAdapter{utilfeature.DefaultMutableFeatureGate}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add this to cainjector and the webhook too?

Signed-off-by: Richard Wall <richard.wall@venafi.com>
@wallrj wallrj force-pushed the 3748-enable-watchlist-streaming-lists branch from 75f28b0 to 9ed80cf Compare September 24, 2024 08:35
Copy link
Copy Markdown
Member Author

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are currently two client-go features:

@wallrj wallrj changed the title WIP: Reduce load on the Kubernetes API server by enabling the WatchList (Streaming Lists) feature Reduce load on the Kubernetes API server and reduce the peak memory use of the cert-manager components by enabling the use of the WatchList (Streaming Lists) feature Sep 24, 2024
@cert-manager-prow cert-manager-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2024
@wallrj wallrj requested a review from inteon September 24, 2024 13:17
Copy link
Copy Markdown
Member

@inteon inteon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@cert-manager-prow cert-manager-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2024
@cert-manager-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: inteon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cert-manager-prow cert-manager-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2024
@cert-manager-prow cert-manager-prow bot merged commit 81bd1c5 into cert-manager:master Sep 24, 2024
@wallrj wallrj deleted the 3748-enable-watchlist-streaming-lists branch September 24, 2024 13:20
#
# - https://kind.sigs.k8s.io/docs/user/configuration/#feature-gates
# - https://kubernetes.io/docs/reference/using-api/api-concepts/#streaming-lists
WatchList: true
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke the K8S 1.25 and 1.26 E2E tests because the WatchList feature was added in 1.27.
We've decided to remove support for 1.25 and 1.26 in cert-manager 1.16.
See:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*not test, we might still support those versions

@wallrj
Copy link
Copy Markdown
Member Author

wallrj commented Sep 26, 2024

/kind feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/deploy Indicates a PR modifies deployment configuration dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cert-manager causes API server panic on clusters with more than 20000 secrets.

3 participants