Skip to content

Cert-manager causes API server panic on clusters with more than 20000 secrets.  #3748

@mvukadinoff

Description

@mvukadinoff

📢 UPDATE: 2024-07-17

  1. In cert-manager 1.12 we introduced an experimental feature that allows cert-manager controller to only list the Secrets that it is concerned with. In cert-manager 1.13 this SecretsFilteredCaching feature was enabled by default. This feature is primarily intended to reduce the memory usage of the cert-manager controller, but it is relevant to this issue because it also stops cert-manager from requesting the data of all Secrets when it starts up. That query is partly responsible for overloading the Kubernetes API server.
  2. In Reduce memory usage of cainjector by only caching the metadata of Secret resources #7161 we've updated cainjector to use metadata-only Secret cache. This is relevant to this issue because it stops cainjector from requesting the data of all Secrets when it starts up. This improvement will be released in cert-manager 1.16.
  3. We've documented why, when and how to configure cainjector to only list the Secrets from the cert-manager namespace, which will also avoid it requesting all the data of all Secrets on start up.
  4. The WatchList feature, will be the ultimate solution to this problem, but it's an alpha / beta feature and not enabled by default in Kubernetes API server. Meanwhile you can experiment with the feature and tell us whether it helps.

Describe the bug:
On clusters with more than 20000 secrets this becomes a problem . The query that Cert manager does is not optimal.
/api/v1/secrets?limit=500&resourceVersion=0

resourceVersion=0 will cause to query always all secrects and limit=500 will not be taken into account. This way cert manager is not scalable for large deployments. Secrets are used not only for certificates.

As mentioned in : kubernetes/kubernetes#56278 and https://kubernetes.io/docs/reference/using-api/api-concepts/

I suggest to remove the resourceVersion=0 from the query which should make it a lot more faster.

Furhtermore cert manager will retry those queries without waiting for them to complete and they pile up and cause significant load even crashes on the API server. Cert manager basically DDoS'es the Api server.

We're hitting the same issue with:
quay.io/jetstack/cert-manager-cainjector:v0.11.0
quay.io/jetstack/cert-manager-controller:v0.11.0
and
quay.io/jetstack/cert-manager-controller:v1.1.0
quay.io/jetstack/cert-manager-cainjector:v1.1.0

Logs from API server


E0115 18:27:27.893242       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
goroutine 79221267 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3b1fda0, 0xc0001c6650)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0feb65c90, 0x1, 0x1)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc08dc08740, 0xc09ea59b80)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc019eb1960, 0x4edf040, 0xc0a1206af0, 0xc07749d900)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x310
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x4edf040, 0xc0a1206af0, 0xc07749d800)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:47 +0x10f
net/http.HandlerFunc.ServeHTTP(0xc0434bf3e0, 0x4edf040, 0xc0a1206af0, 0xc07749d800)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x274
net/http.HandlerFunc.ServeHTTP(0xc0434bf470, 0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc019eb1a20, 0x4edf040, 0xc0a1206af0, 0xc07749d600)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca
net/http.HandlerFunc.ServeHTTP(0xc019eb1a40, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x13e
net/http.HandlerFunc.ServeHTTP(0xc019eb1a60, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2007 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc0434bf4a0, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc009896a80, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:2802 +0xa4
net/http.initNPNRequest.ServeHTTP(0x4eeb300, 0xc06df08a50, 0xc07a0df180, 0xc009896a80, 0x4ed2980, 0xc0c4b6b240, 0xc07749d500)
        /usr/local/go/src/net/http/server.go:3366 +0x8d
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc094106480, 0xc0c4b6b240, 0xc07749d500, 0xc08dc08340)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2149 +0x9f
created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1883 +0x4eb
E0115 18:27:27.893364       1 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0
I0115 18:27:27.893567       1 log.go:172] http2: panic serving 10.148.0.16:53202: killing connection/stream because serving request timed out and response had been started
goroutine 79221267 [running]:
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler.func1(0xc0c4b6b240, 0xc0feb65f67, 0xc094106480)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2142 +0x16b
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc0feb65c90, 0x1, 0x1)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x3b1fda0, 0xc0001c6650)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc08dc08740, 0xc09ea59b80)
        /workspace/anago-v1.16.13-rc.0.25+dda9914de448ab/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*time

Logs from ETCD:

...
Dec 11 09:21:08 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:21:08.106948 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 5.348150525s)
Dec 11 09:21:08 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:21:08.106954 W | etcdserver: server is likely overloaded
posle bavni no uspqvashti ...
Dec 11 09:23:26 ow-prod-k8s-master01 etcd[6830]: 2020-12-11 08:23:26.433315 W | etcdserver: read-only range request "key:\"/registry/persistentvolumes/pvc-f31decea-7a39-4d11-bbbf-8eb45f433239\" " with result "range_response_count:1 size:1017" took too long (13.750148565s) to execute

Logs from cert-manager:

E0203 15:18:34.063192       1 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0

E0203 15:18:33.969252       1 reflector.go:123] external/io_k8s_client_go/tools/cache/reflector.go:96: Failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 37511; INTERNAL_ERROR

Expected behaviour:
cert-manager to not try making heavy queries that need to query all secrets from all namespaces, but instead work per namespace.

Steps to reproduce the bug:

Generate 15000 secrets - no need for them to be for TLS certificates, any secret will do.
Look at the API server load and Cert-manager logs

Anything else we need to know?:

Environment details::

  • Kubernetes version: Kubernetes v1.16.13
  • Cloud-provider/provisioner: Vanilla K8s
  • cert-manager version: v1.1.0
  • Install method: helm (with CRDs applied before that)

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.triage/needs-informationIndicates an issue needs more information in order to work on it.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions