Skip to content

Proposal: Break Up Large Providers by Service#3939

Merged
negz merged 18 commits intocrossplane:masterfrom
negz:thats-the-breaks
Apr 25, 2023
Merged

Proposal: Break Up Large Providers by Service#3939
negz merged 18 commits intocrossplane:masterfrom
negz:thats-the-breaks

Conversation

@negz
Copy link
Copy Markdown
Member

@negz negz commented Apr 3, 2023

Description of your changes

Fixes #3754

This design document proposes that the 6-7 largest Crossplane providers be broken down into smaller, service-scoped ones. This would help folks install fewer CRDs, thus improving the ratio of installed-to-used Crossplane CRDs. Installing fewer CRDs is necessary to workaround performance issues in the Kubernetes API server and Kubernetes clients.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

I proof-read it. 😄

@negz negz requested a review from a team as a code owner April 3, 2023 20:08
@negz negz requested a review from hasheddan April 3, 2023 20:08
@negz negz force-pushed the thats-the-breaks branch from 6233b60 to 4082065 Compare April 3, 2023 22:04
@negz negz requested review from turkenh and ulucinar April 3, 2023 22:05
@negz negz added the proposal label Apr 3, 2023
Comment on lines +88 to +112
* `upbound/provider-aws` - Becomes ~150 smaller providers.
* `upbound/provider-azure` - Becomes ~100 smaller providers.
* `upbound/provider-gcp` - Becomes ~75 smaller providers.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's outside the scope of this design, but for the record we have commitment from folks at Upbound to update:

  • Any large official providers we maintain (i.e. AWS, GCP, and Azure).
  • The upjet tooling.
  • The Upbound marketplace

@blakebarnett
Copy link
Copy Markdown

I'm happy to see a solution for the short-term. It seems like filtering is the ideal solution to shoot for, but one for the long-term (2.0?). It'd be great if it could be targeted for later though. For a multi-cloud or operator-heavy cluster I could see them hitting 500 CRDs pretty frequently, and I assume the number of supported resources per-provider is only going to continue growing in a lot of cases.

@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 3, 2023

I'm happy to see a solution for the short-term. It seems like filtering is the ideal solution to shoot for, but one for the long-term (2.0?)

Just to be explicit, filtering is not something I propose we support long term. Or rather, while I don't want to rule anything out long term I'm not proposing we break providers up as a stop-gap proposal. I'm proposing it because I think it's the best option.

For a multi-cloud or operator-heavy cluster I could see them hitting 500 CRDs pretty frequently, and I assume the number of supported resources per-provider is only going to continue growing in a lot of cases.

Based on my projections and surveys of the community it doesn't seem that likely folks would hit 500 CRDs. FWIW the 500 number is also quite conservative (i.e. it's not like you hit 500 CRDs and everything starts to degrade immediately).

@blakebarnett
Copy link
Copy Markdown

Just to be explicit, filtering is not something I propose we support long term. Or rather, while I don't want to rule anything out long term I'm not proposing we break providers up as a stop-gap proposal. I'm proposing it because I think it's the best option.

Understood. And I agree it's the most pragmatic solution, but we'll have to see how painful the break-up turns out to be later on. Perhaps a way to make it more user-friendly would be to include a "recommended" meta-providerconfig with the most commonly used set of providers per cloud-provider? Or maybe some other way for a user to understand which of the provider configs are going to be required for a given configuration? I'm imagining it being pretty confusing for someone new to crossplane.

@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 3, 2023

Perhaps a way to make it more user-friendly would be to include a "recommended" meta-providerconfig with the most commonly used set of providers per cloud-provider?

Yeah, we considered something like that. I think the tricky part would be landing on what the most commonly used bits are (given a broad audience) and evolving that over time without it getting too bloated.

One similar thing I am really excited about though is the fact that each org or team could put together their own meta-provider that has just the services they use. I could imagine starting small with 1-2 services and adding more over time as you learn what services you need. I'm pretty sure that if you're just doing this for your own org you'll almost certainly end up with more tightly scoped providers (i.e. fewer CRDs) than if we as a community tried to curate "one-size-fits-most" providers.

@haarchri
Copy link
Copy Markdown
Member

haarchri commented Apr 5, 2023

if we take the platform-ref-aws from https://github.com/upbound/platform-ref-aws and apply this proposal to it we have the following crds in the cluster:

  provider-split
crossplane 13
eks-composition 2
network-compositon 2
service-composition 2
app-composition 2
provider-helm 3
provider-aws-ec2 98
provider-aws-iam 22
provider-aws-eks 6
  150

so we see that 126 crds are directly installed from the 3 splitted providers (ec2, iam, eks) - but we only use 15 of those crds -

if we would use a crd filter mechanism like @blakebarnett mentioned we would end up with:

  crd-filter
crossplane 13
eks-composition 2
network-compositon 2
service-composition 2
app-composition 2
provider-helm 3
provider-aws 15
  39

so you can see by splitting the providers we would install 111 CRDs that are not needed - so clearly this simple platform-ref-aws hits the following assumption that there won't be more than 50-100 CRDs in a cluster:

Installing too many CRDs also affects the performance of Kubernetes clients like
kubectl, Helm, and ArgoCD. These clients are often built under the assumption
that there won’t be more than 50-100 CRDs installed on a cluster.

so lets have a look in our companies crossplane installation we would have the following aws-providers installed after the split:

providers crds
provider-aws-acm 2
provider-aws-acmpca 5
provider-aws-appmesh 7
provider-aws-cloudwatchevents 8
provider-aws-cloudwatchlogs 8
provider-aws-directconnect 16
provider-aws-docdb 7
provider-aws-dynamodb 7
provider-aws-ec2 98
provider-aws-efs 6
provider-aws-eks 6
provider-aws-opensearch 3
provider-aws-elb 9
provider-aws-elbv2 5
provider-aws-grafana 5
provider-aws-iam 22
provider-aws-kafka 2
provider-aws-kms 7
provider-aws-networkfirewall 4
provider-aws-rds 22
provider-aws-route53 9
provider-aws-route53resolver 3
provider-aws-s3 23
proider-aws-ses 13
provider-aws-sesv2 6
sum 303

so we see that 303 crds are directly installed from the 25 splitted providers (...) - but we only use 97 of those crds - so you can see by splitting the providers we would install 206 CRDs that are not needed

so at the moment i don't really see that the effort to split the providers will help to decrease the number of installed CRDs in a real world scenario this much ?! - also the effort needed by the operation teams to manage more (in our case 25) providers only to manage aws resources and the ability of crossplane to run smootly with this number of providers is not even part of this discussion - is someone from our community folks running crossplane installations with more then 10 or 20 providers ?

@akesser
Copy link
Copy Markdown

akesser commented Apr 5, 2023

In my opinion the numbers shown by @haarchri clearly support the filtering solution proposed by @blakebarnett. Only this approach would allow a decrease of the number installed crds by one order. Additional there would be no need to update the mechanism for resource cross references in the short therm, as all referenced resources live in the same provider.

In short term the filtering could happen in crossplane-core by not applying crd as configured by the user when installing a provider. In the long term, if a more generic approach for resource cross references would be established, a provider could be updated to also ignore these resources, even if he knows them. In that way, also several pods could be used to spread the work by installing a provider several times with different configurations, eg. one pod only running reconcilers for *.ec2 resources, one only for *.rds resources and one for all other resources of the provider.

I would not count the following in favor of splitting the providers

  • Requires no changes to Crossplane core - no waiting months for core changes to reach
       beta and be on by default.

The proposed splitting sound like well designed for clusters one shows in live demos of talks with perhaps three resources and therefore threes (sub-) providers installed. In a real world cluster where the goal is to really move to IaC, this only adds the burden to manage more providers and to keep them in sync. I would favor to wait some months to implement a change in a  reliable and correct way that eases the use of crossplane instead of complicating the path to a functioning installation by breaking providers up into pieces that in their majority are now working as standalone providers.

@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 5, 2023

@haarchri In the numbers you've crunched it seems to me that we're seeing quite a big improvement in the ratio of installed-to-used CRDs. This is also what I saw when I ran through similar scenarios. Today for example the official provider-aws has 900 CRDs, so:

  • For platform-ref-aws we go from 900:15 to 126:15.
  • For your company we go from 900:97 to 303:97.

These numbers seem well within the capabilities of Kubernetes. I would not expect to be seeing the API server struggle with excess compute resource consumption, and I would not expect kubectl to experience discovery or CRD category query issues.

Is the only acceptable result as far as you're concerned to completely optimize the ratio of installed-to-used CRDs? That is, to make sure Crossplane never installs a CRD that you don't intend to use? If so, I would ask why? Also, do you use other CRD-based tools? Do those tools allow you to leave out the CRDs you don't need? What about built-in Kubernetes types?

I think based on #2869 you've previously answered my first question:

we have the same requirement from security department to not deploy all CRDs - only approved CRDs - which matches our IAM Policies

I'd love to hear more about it. Perhaps we should setup a call? My guess is you have fairly extreme defense-in-depth requirements? My thinking is:

  • Even when CRDs are installed Crossplane must still be granted IAM permissions at the cloud level, and API server users must still be granted RBAC access to work with the relevant CRs. This seems like quite a lot of defense alone.
  • Even when you don't install the CRDs, the underlying code is still there in the provider. Sure you can't trigger it, so the attack surface is smaller, but doesn't RBAC also achieve this?

At the end of the day if it's a hard requirement that Crossplane never installs a CRD that you don't want then yes the only tenable option is down-to-the-type level filtering (or compiling your own providers as you currently do). I wonder how many folks have this requirement. I believe "just adding filtering" is not as simple as it first sounds to most folks and in fact increases cognitive complexity for folks learning Crossplane quite a lot, so I think there should be a pretty high bar for doing it.

also the effort needed by the operation teams to manage more (in our case 25) providers only to manage aws resources and the ability of crossplane to run smootly with this number of providers is not even part of this discussion

I had intended to cover this in the design - see for example discussion of ProviderConfigs, compute resources, etc. Are there other areas you'd like me to cover? If so, let me know.

@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 5, 2023

Only this approach would allow a decrease of the number installed crds by one order.

@akesser Can you help me understand why you need the number of installed CRDs to be so dramatically reduced?

In short term the filtering could happen in crossplane-core by not applying crd as configured by the user when installing a provider.

It is unfortunately not that simple. I could expand on this in the existing "Alternatives Considered" section if that would help.

The proposed splitting sound like well designed for clusters one shows in live demos of talks with perhaps three resources and therefore threes (sub-) providers installed. In a real world cluster where the goal is to really move to IaC, this only adds the burden to manage more providers and to keep them in sync.

The design doc includes a survey of ~40 community members and their real usage - I would not be proposing something if I didn't think it would work in the real-world. It would be helpful if you could elaborate on what concerns you have around management burden.

@akesser
Copy link
Copy Markdown

akesser commented Apr 6, 2023

@negz I talked about decreasing the numbers of installed CRDs because you explicitly mentioned that most tools are designed to work well with clusters containing 50 to 100 CRDs and the suggested approach to split the providers does not decrease the numbers of CRDs in a way to reach this.

stalling too many CRDs also affects the performance of Kubernetes clients like
kubectl, Helm, and ArgoCD. These clients are often built under the assumption
that there won’t be more than 50-100 CRDs installed on a cluster.

By having ten to hundred times more providers you have to keep them updated, you have to keep them running and you need more effort until you have the code on the cluster, e.g. by four eye principle for merging into your codebase

@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 6, 2023

you explicitly mentioned that most tools are designed to work well with clusters containing 50 to 100 CRDs

Got it. A lot of work has been done to the ecosystem since then, so per the design the very conservative number is more like 500. (So probably more like 600, 700 in reality).

By having ten to hundred times more providers you have to keep them updated, you have to keep them running and you need more effort until you have the code on the cluster, e.g. by four eye principle for merging into your codebase

FWIW by my estimates a hundred times more providers would be quite a rare edge case.

Isn't keeping pods running Kubernetes's bread and butter? I'm not arguing it's zero additional operational burden, but I'm not sure its meaningful.

@nabuskey
Copy link
Copy Markdown

nabuskey commented Apr 7, 2023

Putting technical details aside, how does the community feel about the risk? What if this approach does not ultimately solve the problem for one reason or another? If Kubernetes improves its CRD handling drastically and can scale CRDs to 100000, do these providers need to be kept broken down?

Breaking up providers or going back to the single provider model in isolation is fine. However, taking recent changes to providers into consideration, I can't say with confidence it won't be a problem. For context, recent changes I am referring to are:

  1. Introduction of jet providers.
  2. Deprecation of jet providers.
  3. Introduction of upjet.
  4. Introduction of providers with same names.

In my opinion these changes caused fragmentation and confusion in reference docs, blog posts, and examples. I understand providers themselves do not define what Crossplane is, but I worry what kind of impression people get about Crossplane. I'd love to hear others' thoughts here.

@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 10, 2023

What if this approach does not ultimately solve the problem for one reason or another?

As far as I'm aware there's only two feasible solutions to the problem at hand - breaking up providers or adding support for filtering the types providers install. In either of those paths we should consider a way to ship the feature/fix and get feedback before we fully commit.

With filtering it would be the typical alpha feature lifecycle we use for Crossplane (complicated by the fact that the functionality needs to be added at both the core and provider level). So the feature would be behind a feature flag and off by default for some time to allow folks to try it out before we commit. If we found it didn't work well anyone using it would be disrupted when we removed it or changed in a breaking way.

If we instead break up providers I could imagine publishing some 'alpha' service-scoped providers to let folks try them out
and give us feedback before we fully committed to the approach. If we found it didn't work well anyone using it would be disrupted when we stopped building the service-scoped providers and they needed to switch back to the monoliths.

I suspect given how many folks are impacted by the problem this proposal solves that a lot of folks will opt-in to the alpha (whether it be filtering or smaller providers) and thus a lot of folks will be disrupted if we found we didn't get it right up-front. It doesn't seem to me like either option is significantly less disruptive to "roll back".

This topic is probably worth a section in the proposal - I'll add one.

If Kubernetes improves its CRD handling drastically and can scale CRDs to 100000, do these providers need to be kept broken down?

This is an interesting thought experiment. I will say that given the multi-faceted nature of the issues and the slow, incremental pace of improvements I don't expect to see this be the case for at least 24 months. Keep in mind that we have pursued the "fix Kubernetes" approach for well over a year with limited success. I don't see any reason to believe things will change in the foreseeable future.

@akesser
Copy link
Copy Markdown

akesser commented Apr 14, 2023

Our Team created two PRs to show filtering of CRDs in crossplane-core and crossplane-contrib/provider-aws:

#3987
crossplane-contrib/provider-aws#1727

Feedback is wellcome

negz added 12 commits April 25, 2023 13:55
In particular that addressing any security concern is a non-goal.

Signed-off-by: Nic Cope <nicc@rk0n.org>
I plan to expand on a few of these

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Adds a lot more detail on the constraints and nuances that may not be
obvious to folks. I'm also explicit about why I don't recommend it; not
because its hard but because it makes reasoning about Crossplane harder.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
We're going to hold off on this at first - it's a further optimization
we can make (at the expense of more complexity) if/when needed.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz force-pushed the thats-the-breaks branch from fed6131 to 6f7fd8a Compare April 25, 2023 20:55
Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz force-pushed the thats-the-breaks branch from 6f7fd8a to d5d68f3 Compare April 25, 2023 20:58
@negz negz merged commit 1a68208 into crossplane:master Apr 25, 2023
@negz negz deleted the thats-the-breaks branch April 25, 2023 21:05
ulucinar added a commit to ulucinar/upbound-provider-gcp that referenced this pull request Apr 26, 2023
- This PR implements the upstream proposal at:
  crossplane/crossplane#3939

- Subpackages belonging to each API group is produced. An example is:
  provider-gcp-cloudplatform.
- ProviderConfig, ProviderConfigUsage and StoreConfig are part of a
  config package named provider-gcp-config.
- The monolith package (containing all the CRDs and associated controllers)
  is still produced.
- Each produced package except for the monolith package has the
  `pkg.crossplane.io/provider-family` label in its package metadata.
- Each service package except for the config package declares a
  dependency to the config package.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
ulucinar added a commit to ulucinar/upbound-provider-gcp that referenced this pull request Apr 26, 2023
- This PR implements the upstream proposal at:
  crossplane/crossplane#3939

- Subpackages belonging to each API group is produced. An example is:
  provider-gcp-cloudplatform.
- ProviderConfig, ProviderConfigUsage and StoreConfig are part of a
  config package named provider-gcp-config.
- The monolith package (containing all the CRDs and associated controllers)
  is still produced.
- Each produced package except for the monolith package has the
  `pkg.crossplane.io/provider-family` label in its package metadata.
- Each service package except for the config package declares a
  dependency to the config package.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
ulucinar added a commit to ulucinar/upbound-provider-gcp that referenced this pull request Apr 26, 2023
- This PR implements the upstream proposal at:
  crossplane/crossplane#3939

- Subpackages belonging to each API group is produced. An example is:
  provider-gcp-cloudplatform.
- ProviderConfig, ProviderConfigUsage and StoreConfig are part of a
  config package named provider-gcp-config.
- The monolith package (containing all the CRDs and associated controllers)
  is still produced.
- Each produced package except for the monolith package has the
  `pkg.crossplane.io/provider-family` label in its package metadata.
- Each service package except for the config package declares a
  dependency to the config package.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
ulucinar added a commit to ulucinar/upbound-provider-gcp that referenced this pull request Apr 26, 2023
- This PR implements the upstream proposal at:
  crossplane/crossplane#3939

- Subpackages belonging to each API group is produced. An example is:
  provider-gcp-cloudplatform.
- ProviderConfig, ProviderConfigUsage and StoreConfig are part of a
  config package named provider-gcp-config.
- The monolith package (containing all the CRDs and associated controllers)
  is still produced.
- Each produced package except for the monolith package has the
  `pkg.crossplane.io/provider-family` label in its package metadata.
- Each service package except for the config package declares a
  dependency to the config package.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
ulucinar added a commit to ulucinar/upbound-provider-gcp that referenced this pull request Apr 26, 2023
- This PR implements the upstream proposal at:
  crossplane/crossplane#3939

- Subpackages belonging to each API group is produced. An example is:
  provider-gcp-cloudplatform.
- ProviderConfig, ProviderConfigUsage and StoreConfig are part of a
  config package named provider-gcp-config.
- The monolith package (containing all the CRDs and associated controllers)
  is still produced.
- Each produced package except for the monolith package has the
  `pkg.crossplane.io/provider-family` label in its package metadata.
- Each service package except for the config package declares a
  dependency to the config package.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
turkenh pushed a commit to turkenh/crossplane that referenced this pull request Apr 28, 2023
crossplane#3939

The above design proposes we break up larger providers like provider-aws
by service. We'd group these providers into a 'family' for two reasons:

1. So they could all share one ProviderConfig (and StoreConfig etc).
2. So they could cross-resource-reference each other.

We believe 1 will always be true. Hopefully 2 will eventually go away
with crossplane#1770.

During testing we realised providers in the same family would need RBAC
access to read all types in their family, i.e. to reference an MR or
read a ProviderConfig.

Signed-off-by: Nic Cope <nicc@rk0n.org>
(cherry picked from commit 68a81f7)
AndrewChubatiuk pushed a commit to AndrewChubatiuk/crossplane that referenced this pull request May 4, 2023
crossplane#3939

The above design proposes we break up larger providers like provider-aws
by service. We'd group these providers into a 'family' for two reasons:

1. So they could all share one ProviderConfig (and StoreConfig etc).
2. So they could cross-resource-reference each other.

We believe 1 will always be true. Hopefully 2 will eventually go away
with crossplane#1770.

During testing we realised providers in the same family would need RBAC
access to read all types in their family, i.e. to reference an MR or
read a ProviderConfig.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Andrew Chubatiuk <andrew.chubatiuk@motional.com>
negz added a commit to negz/crossplane that referenced this pull request May 24, 2023
crossplane#3939

The above design proposes we break up larger providers like provider-aws
by service. We'd group these providers into a 'family' for two reasons:

1. So they could all share one ProviderConfig (and StoreConfig etc).
2. So they could cross-resource-reference each other.

We believe 1 will always be true. Hopefully 2 will eventually go away
with crossplane#1770.

During testing we realised providers in the same family would need RBAC
access to read all types in their family, i.e. to reference an MR or
read a ProviderConfig.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Break up large providers by service