Skip to content

Consolidation of CRD Scaling Issues #2895

@ulucinar

Description

@ulucinar

What problem are you facing?

Previously in crossplane/terrajet#47, #2649 and kubernetes/kubernetes#105932, we have observed high resource consumption in the Kubernetes API Server when over 500 CRDs are installed. Especially with local kind clusters, installing provider-jet-aws with over 700 CRDs could render the cluster unresponsive on machines with relatively low CPU resources, due to the high CPU consumption. And with the managed Kubernetes offerings such as GKE, EKS and AKS, with varying levels dependent on the cluster configurations, we had observed API service disruptions as we installed 100s or 1000s of CRDs. While discussing those issues as the Crosplane community, we had tried a couple of different workarounds detailed in #2649, such as installing CRDs in smaller batches and giving the API Server some time to finish aggregating their OpenAPI specs.

Meanwhile after getting in touch with the upstream Kubernetes maintainers, we learned that they already were working on kubernetes/kube-openapi#251, which would introduce lazy marshalling for the aggregated OpenAPI specs of the installed CRDs served from the /openapi/v2 endpoint. Collecting profiling data and builing kind images with that PR and testing them in local kind clusters, we observed substantial improvements in peak resource consumption. However, as discussed in #2649 there were still some open points left:

  • Discovery client throttles itself (client-side throttling) while discovering the exposed API, if its cache is unestablished (first time discovery) or is invalidated (subsequent discoveries)
  • Metrics collected from the API server reveal a continued memory consumption of ~ 2.75 GiB for ~600 CRDs and memory consumption increases with the number of CRDs

Although we could not actually observe the effects of the aggregated OpenAPI spec lazy-marshalling optimization in a managed Kubernetes offering [1], we expected it to alleviate control plane service disruptions. Starting with Kubernetes versions v1.20.13, v1.21.7, v1.22.4, v1.23.0 this fix is available and as of this writing:

  • v1.23.1 and v1.22.4 are available for most GKE regions via the rapid GKE release channel, and
  • v1.22.4 and v1.21.7 are available for most AKS regions.

This allowed us to test the hypothesis with the lazy-marshalling optimization and we have recently made a number of tests with the managed Kubernetes offerings. Please note that the following are reports from single experiments, and only server-side issues are reported. For all clusters even if there is no API service disruption, we always experience client-side throttling.

Related issues


GKE Zonal

On a GKE v1.23.1 zonal cluster with three worker nodes of type e2-medium, it took 56 min for the ProviderRevision to acquire the Healthy condition for a provider-jet-aws installation with 763 CRDs. During this period, the cluster needed at least one repair, and there was over 50 min of service disruption. After a slight confusion regarding worker-node HA and control-plane HA, this thread suggests using regional clusters for control-plane high availability, and states that control-plane service disruptions are expected for zonal GKE clusters as GKE is scaling the managed control-plane under load. However, as discussed next, the situation with a regional cluster was not good either.


GKE Regional

A v1.23.1 cluster was provisioned in the us-east1 region with one e2-medium node in each availability zone, with a total of 3 worker nodes in 3 zones. Although it took only ~ 150 s for the ProviderRevision to acquire the Healthy condition for the provider-jet-aws@v0.4.0-preview installation, the regional GKE cluster went through the repairing mode at least three times afterwards. In between these "RUNNING" and "RECONCILING" states of the regional cluster, we have observed different kinds of errors in response to the kubectl commands run, notably connection errors and I/O timeouts while reaching the API server. It took over an hour for the cluster to stabilize. However, the control-plane was intermittently available for short periods during this period.


EKS

When we performed these tests, none of the Kubernetes versions consuming the lazy-marshalling optimization were available for EKS clusters. However, we used a v1.21.5-eks-bc4871b 3-worker node cluster to test a provider-jet-aws@v0.4.0-preview installation. Grafana dashboards indicate a control-plane with two members from the very beginning before provider-jet-aws package is installed into the system. It took the ProviderRevision about 2 min to acquire the Healthy condition. The cluster stayed stable after the provider installation. A reexamination of the Grafana dashboards after 3 hours reveals that the two-member control-plane was scaled up and down during this period. Apart from the client-side throttling issues, control-plane is stable with a provider-jet-aws installation.


AKS

A v1.22.4 cluster was provisioned with two worker nodes. It took ~112 s for the ProviderRevision to acquire the Healthy condition for a provider-jet-aws@v0.4.0-preview installation. The cluster stayed stable after the provider installation. Also installed provider-jet-azure@v0.7.0-preview on this cluster making the total CRD count 1430. It took 91 s for provider-jet-azure's ProviderRevision to acquire the Healthy condition. It takes ~ 40 s for the discovery client to refresh its cache. And we observe some pod restarts due to timeouts. But the cluster stays stable. However, adding provider-jet-gcp@v0.2.0-preview as a third provider has finally "breaks" the cluster. Crossplane is having a hard time installing the package.

We also repeated these experiments with a two worker node AKS v1.20.13 with similar results: It took 107 s for the ProviderRevision to acquire the Healthy condition. The cluster stayed stable after the provider installation. Reaching a total of 780 CRDs, the cluster has no severe issues. Client-side throttling kicks in as expected.


We also need a deeper understanding on which metrics are being employed by the Cloud providers when deciding to scale up their control planes. Usual metrics are CPU/memory consumption and/or utilization of the control-plane components but they could as well be depending on metrics such as the number of installed CRDs and/or API Server response time SLIs, etc.

Another dimension where we need clarification is why we observe control plane service disruptions when the managed control plane is scaling up. It's perfectly expected, at some point as we continue adding more CRDs to the cluster, the control plane will need to scale up but if it has high-availability features, then why are we not isolated from the effects?

We are now at a point where there are several upstream Kubernetes issues and PRs, Crossplane issues and discussions happening at different places, which makes it hard to figure out the exact problems and root causes. Also under #2649, there is still some discussion going on. This motivates us to open this issue so that we can decide how to move on.

[1]: Currently, we are not aware of a configuration option or another mechanism that would allow us to replace the kube-openapi with a custom build for an AKS/EKS/GKE cluster.

How could Crossplane help solve your problem?

We can have a one-pager that includes:

  • The set of problems we are still observing like the continuos memory consumption and client-side throttling issues with the discovery client. We also need a deeper understanding of these issues and their mechanics.
  • A deeper understanding of how Cloud providers are scaling their Kubernetes control planes and which metrics they are depending on. And also a deeper understanding on why we experience service disruptions even with HA control planes when installing 1000s of CRDs. We can achieve this understanding by opening issues to the Cloud providers for them to investigate and having discussions with them.
  • A definition for the ideal state like reasonable response times on certain kubectl requests (cached and uncached) when there are 100s of CRDs and memory usage per 100s CRDs.
  • It would be helpful to have scalability goal(s) for Crossplane in terms of the number of (active) managed resource types (CRDs) to support in a cluster with an appropriate size. This can be input for a Crossplane scalability guide and/or for advocating Crossplane use-cases when CRD scalability thresholds are being discussed upstream, etc.
  • The tooling to quickly reproduce the cases and test the progress against the ideal state. Go/Bash tools to run these tests for a given cluster and report the comparison. References to all known issues and PRs about the topic.

This would help all stakeholders to see all the data we have, reproduce the problems quickly and make more informed decisions about how to move forward. The tooling could be part of the Kubernetes Conformance tests at one point but it’s not in the scope of this issue.

Another goal we should work towards is to get CRDs as a scalability dimension in the Kubernetes thresholds document as discussed here.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions