Consolidation of CRD Scaling Issues

### What problem are you facing?

Previously in https://github.com/crossplane/terrajet/issues/47, #2649 and https://github.com/kubernetes/kubernetes/issues/105932, we have observed high resource consumption in the Kubernetes API Server when over 500 CRDs are installed. Especially with local `kind` clusters, installing [provider-jet-aws](https://github.com/crossplane-contrib/provider-jet-aws) with over 700 CRDs could render the cluster unresponsive on machines with relatively low CPU resources, due to the high CPU consumption. And with the managed Kubernetes offerings such as GKE, EKS and AKS, with varying levels dependent on the cluster configurations, we had observed API service disruptions as we installed 100s or 1000s of CRDs. While discussing those issues as the Crosplane community, we had tried a couple of different workarounds detailed in #2649, such as installing CRDs in smaller batches and giving the API Server some time to finish aggregating their OpenAPI specs.

Meanwhile after getting in touch with the upstream Kubernetes maintainers, we learned that they already were working on https://github.com/kubernetes/kube-openapi/pull/251, which would introduce lazy marshalling for the aggregated OpenAPI specs of the installed CRDs served from the `/openapi/v2` endpoint. Collecting profiling data and builing `kind` images with that PR and testing them in local `kind` clusters, we observed substantial [improvements](https://github.com/crossplane/crossplane/issues/2649#issuecomment-956914007) in peak resource consumption. However, as discussed in #2649 there were still some open points left:
- Discovery client throttles itself (client-side throttling) while discovering the exposed API, if its cache is unestablished (first time discovery) or is invalidated (subsequent discoveries)
- Metrics collected from the API server reveal a continued memory consumption of ~ 2.75 GiB for ~600 CRDs and memory consumption increases with the number of CRDs

Although we could not actually observe the effects of the aggregated OpenAPI spec lazy-marshalling [optimization](https://github.com/kubernetes/kube-openapi/pull/251) in a managed Kubernetes offering [1], we expected it to alleviate control plane service disruptions. Starting with Kubernetes versions `v1.20.13`, `v1.21.7`, `v1.22.4`, `v1.23.0` this fix is available and as of this writing:
- `v1.23.1` and `v1.22.4` are available for most GKE regions via the `rapid` [GKE release channel](https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels), and
- `v1.22.4` and `v1.21.7` are available for most AKS regions.

This allowed us to test the hypothesis with the lazy-marshalling optimization and we have recently made a number of tests with the managed Kubernetes offerings. *Please note that the following are reports from single experiments, and only server-side issues are reported.* For all clusters even if there is no API service disruption, we always experience client-side throttling.

### Related issues

 - [x] #2869

<hr/>

#### GKE Zonal
On a GKE `v1.23.1` zonal cluster with three worker nodes of type `e2-medium`, it took `56 min` for the `ProviderRevision` to acquire the `Healthy` condition for a [provider-jet-aws](https://github.com/crossplane-contrib/provider-jet-aws/releases/tag/v0.4.0-preview) installation with 763 CRDs. During this period, the cluster needed at least one repair, and there was over 50 min of service disruption. After a slight confusion regarding worker-node HA and control-plane HA, [this thread](https://issuetracker.google.com/u/1/issues/192000009?pli=1) suggests using regional clusters for control-plane high availability, and states that control-plane service disruptions are expected for zonal GKE clusters as GKE is scaling the managed control-plane under load. However, as discussed next, the situation with a regional cluster was not good either.

<hr/>

#### GKE Regional
A `v1.23.1` cluster was provisioned in the `us-east1` region with one `e2-medium` node in each availability zone, with a total of 3 worker nodes in 3 zones. Although it took only `~ 150 s` for the `ProviderRevision` to acquire the `Healthy` condition for the `provider-jet-aws@v0.4.0-preview` installation, the regional GKE cluster went through the repairing mode at least three times afterwards. In between these "RUNNING" and "RECONCILING" states of the regional cluster, we have observed different kinds of errors in response to the `kubectl` commands run, notably connection errors and I/O timeouts while reaching the API server. It took over an hour for the cluster to stabilize. However, the control-plane was intermittently available for short periods during this period.

<hr/>

#### EKS
When we performed these tests, none of the Kubernetes versions consuming the lazy-marshalling optimization were available for EKS clusters. However, we used a `v1.21.5-eks-bc4871b` 3-worker node cluster to test a `provider-jet-aws@v0.4.0-preview` installation. Grafana dashboards indicate a control-plane with two members from the very beginning before `provider-jet-aws` package is installed into the system. It took the `ProviderRevision` about 2 min to acquire the `Healthy` condition. The cluster stayed stable after the provider installation. A reexamination of the Grafana dashboards after 3 hours reveals that the two-member control-plane was scaled up and down during this period. Apart from the client-side throttling issues, control-plane is stable with a `provider-jet-aws` installation.

<hr/>

#### AKS
A `v1.22.4` cluster was provisioned with two worker nodes. It took `~112 s` for the `ProviderRevision` to acquire the `Healthy` condition for a `provider-jet-aws@v0.4.0-preview` installation. The cluster stayed stable after the provider installation. Also installed [`provider-jet-azure@v0.7.0-preview`](https://github.com/crossplane-contrib/provider-jet-azure/releases/tag/v0.7.0-preview) on this cluster making the total CRD count `1430`. It took `91 s` for `provider-jet-azure`'s ProviderRevision to acquire the `Healthy` condition. It takes `~ 40 s` for the discovery client to refresh its cache. And we observe some pod restarts due to timeouts. But the cluster stays stable. However, adding [`provider-jet-gcp@v0.2.0-preview`](https://github.com/crossplane-contrib/provider-jet-gcp/releases/tag/v0.2.0-preview) as a third provider has finally "breaks" the cluster. Crossplane is having a hard time installing the package. 

We also repeated these experiments with a two worker node AKS `v1.20.13` with similar results: It took `107 s` for the ProviderRevision to acquire the `Healthy` condition. The cluster stayed stable after the provider installation. Reaching a total of 780 CRDs, the cluster has no severe issues. Client-side throttling kicks in as expected.

<hr/>

We also need a deeper understanding on which metrics are being employed by the Cloud providers when deciding to scale up their control planes. Usual metrics are CPU/memory consumption and/or utilization of the control-plane components but they could as well be depending on metrics such as the number of installed CRDs and/or API Server response time SLIs, etc. 

Another dimension where we need clarification is why we observe control plane service disruptions when the managed control plane is scaling up. It's perfectly expected, at some point as we continue adding more CRDs to the cluster, the control plane will need to scale up but if it has high-availability features, then why are we not isolated from the effects?

We are now at a point where there are several upstream Kubernetes issues and PRs, Crossplane issues and discussions happening at different places, which makes it hard to figure out the exact problems and root causes. Also under #2649, there is still some discussion going on. This motivates us to open this issue so that we can decide how to move on.


[1]: Currently, we are not aware of a configuration option or another mechanism that would allow us to replace the `kube-openapi` with a custom build for an AKS/EKS/GKE cluster.


### How could Crossplane help solve your problem?

We can have a one-pager that includes:
- The set of problems we are still observing like the continuos memory consumption and client-side throttling issues with the discovery client. We also need a deeper understanding of these issues and their mechanics. 
- A deeper understanding of how Cloud providers are scaling their Kubernetes control planes and which metrics they are depending on. And also a deeper understanding on why we experience service disruptions even with HA control planes when installing 1000s of CRDs. We can achieve this understanding by opening issues to the Cloud providers for them to investigate and having discussions with them.
- A definition for the ideal state like reasonable response times on certain kubectl requests (cached and uncached) when there are 100s of CRDs and memory usage per 100s CRDs.
- It would be helpful to have scalability goal(s) for Crossplane in terms of the number of (active) managed resource **types** (CRDs) to support in a cluster with an appropriate size. This can be input for a Crossplane scalability guide and/or for advocating Crossplane use-cases when CRD scalability thresholds are being discussed upstream, etc.
- The tooling to quickly reproduce the cases and test the progress against the ideal state. Go/Bash tools to run these tests for a given cluster and report the comparison. References to all known issues and PRs about the topic.

This would help all stakeholders to see all the data we have, reproduce the problems quickly and make more informed decisions about how to move forward. The tooling could be part of the Kubernetes Conformance tests at one point but it’s not in the scope of this issue.

Another goal we should work towards is to get CRDs as a scalability dimension in the Kubernetes [thresholds document](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md) as discussed [here](https://github.com/kubernetes/kubernetes/issues/105932#issuecomment-954650339).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidation of CRD Scaling Issues #2895

What problem are you facing?

Related issues

GKE Zonal

GKE Regional

EKS

AKS

How could Crossplane help solve your problem?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consolidation of CRD Scaling Issues #2895

Description

What problem are you facing?

Related issues

GKE Zonal

GKE Regional

EKS

AKS

How could Crossplane help solve your problem?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions