Match claims to classes using label selectors

## What problem are you facing?

In Crossplane 0.3 we introduced support for "provider specific" (aka "non-portable", aka "[strongly typed](https://github.com/crossplaneio/crossplane/blob/69e69c/design/one-pager-strongly-typed-class.md)") resource classes. During the development cycle [we identified](https://github.com/crossplaneio/crossplane/issues/723) that if a claim must explicitly reference a provider specific resource class in order to enable dynamic provisioning, then the claim itself is now provider specific and no longer portable across providers. We solved this issue by introducing [portable resource classes](https://github.com/crossplaneio/crossplane/blob/69e69c/design/one-pager-default-resource-class.md).

A portable resource class is effectively an indirection to a non-portable resource class. Portable resource classes can be set as the default for a particular namespace, and will be used by any resource claims of their corresponding kind that do not specify a resource class for dynamic provisioning, or a managed resource for static provisioning. This can be thought of as "publishing" non-portable resource classes (which may exist in a distinct namespace used to logically group infrastructure) to other namespaces, where applications may use them to satisfy claims for their infrastructure requirements.

This pattern is flexible and powerful, but Crossplane maintainers and [community members](https://github.com/crossplaneio/crossplane/issues/703#issuecomment-536958545) have observed that it's verbose; three distinct Kubernetes resources (a claim, portable class, and non-portable class) must exist for dynamic provisioning to occur.

## How could Crossplane help solve your problem?

Crossplane could remove the need for portable resource classes by using [label selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors) at the resource claim level, rather than the concept of a default class.

### User experience

Resource claim authors who have a specific, non-portable resource class in mind just set the claim's `classRef` to that resource class (including its type metadata) just like they did before https://github.com/crossplaneio/crossplane/issues/723. Resource claim authors who have a specific, managed resource in mind just set the claim's `resourceRef` as they do today for static provisioning. Resource claim authors who don't know exactly what resource class they want omit the `classRef`, omit the `resourceRef`, and supply zero or more `classSelector` labels to match.

For example:

```yaml
---
# Imagine this resource claim is created by a hypothetical GitLab stack
# controller to satisfy a request for a new GitLab installation. It does not
# specifically name a resource class, but declares it wants an experimental
# grade PostgreSQLInstance that is compatible with the Gitlab stack, in
# region us-east-1.
apiVersion: database.crossplane.io/v1alpha1
kind: PostgreSQLInstance
metadata:
  name: gitlab-database
  namespace: gitlab-experiments
spec:
  classSelector:
    matchLabels:
      stack: gitlab
      grade: experimental
      region: us-east-1
  writeConnectionSecretToRef:
    name: app-postgresql-connection
```

Note that under this proposal the concept of a "portable class" is removed, so any reference to "class" or "resource class" means "non-portable resource class", for example `CloudSQLInstanceClass`.

From the claim author's perspective Crossplane would handle dynamic provisioning of the above `PostgreSQLInstance` claim by:

1. "Finding" all classes that match its `.spec.classSelector.matchLabels`.
1. "Picking one at random".
1. Seting the `PostgreSQLInstance` `.spec.classRef` to the randomly selected resource class.
1. Dynamically provisioning and binding as we do today.

This proposal has the following properties:

* It supports portable, strongly typed, dynamic provisioning using only two concepts; resource claims and resource classes.
* It works the same for a hypothetical non-portable `CloudSQLInstanceClaim` as it does for a portable `PostgreSQLInstanceClaim`.
* It approximates support for "default resource classes". Label a resource class `{default: "true"}` to make it the global default for compatible resource claims (that select that label). Label a resource class `{default: "true", namespace: app-project1-dev}`, to make it the "default for a namespace" (actually, the default for any claim that selects those labels).
* It makes resource class selection non-deterministic. If multiple resource classes match the labels, one is chosen at random.
* It removes the concept of "publishing" a class to a namespace. This means a cluster operator can set a class as the "default" (kind of - see above) for app namespaces that don't exist yet, but also makes it harder for app operators to determine what resource classes they could choose from, presuming the resource classes lived in a distinct infrastructure namespace.
* It enables a rudimentary form of "matching" (dare I say scheduling?) of resource claims by decoupling the language used to match claims to classes from the (strongly typed) language used to describe classes.

Elaborating on that last bullet, it's hard to match the requirements of a resource claim to the capabilities of a resource class on `spec` alone due to the "lowest common denominator" problem of multi cloud. The `RedisCluster` claim, for example, exposes a field named `engineVersion`, which specifies its desired version of Redis (e.g. `"3.2"`). Ideally we'd use this snippet of information to intelligently match a `RedisCluster` claim to a resource class for dynamic provisioning, but that is hard(TM). The following resource classes could satisfy a `RedisCluster` claim today:

* GCP `CloudMemorystoreInstanceClass` has an equivalent field to `engineVersion` - it's named `redisVersion` and expects the version to be written as `"REDIS_3_2"`, not `"3.2"`.
* AWS `ReplicationGroupClass` has an `engineVersion` field that expects the patch version to be included, i.e. `"3.2.0"`.
* Azure `RedisClass` has no field for engine version. It's just always version 3.2.

We could try to solve this by having resource claims expose the intersection (or the union?) of every possible resource class's set of fields in some standardised fashion, then teach each resource claim controller how to match the standardised resource claim fields to the provider specific resource class fields, or we could just use labels.

For example this resource claim specifies that it needs Redis 3.2:

```yaml
---
apiVersion: cache.crossplane.io/v1alpha1
kind: RedisCluster
metadata:
  name: gitlab-cache
  namespace: gitlab-experiments
spec:
  classSelector:
    matchLabels:
      stack: gitlab
      engineVersion: "3.2"
  writeConnectionSecretToRef:
    name: gitlab-cache-connection
```

While these resource classes specify that they can satisfy claims that need Redis 3.2, while also using documented, high fidelity, validated, schemas when dynamically provisioning managed resources:

```yaml
---
apiVersion: cache.gcp.crossplane.io/v1alpha2
kind: CloudMemorystoreInstanceClass
metadata:
  name: gitlab-development-default
  namespace: gcp-infra-dev
  labels:
    engineVersion: "3.2"
specTemplate:
  tier: STANDARD_HA
  region: us-west2
  memorySizeGb: 1
  redisVersion: REDIS_3_2
  providerRef:
    name: example
    namespace: gcp-infra-dev
  reclaimPolicy: Delete
---
apiVersion: cache.azure.crossplane.io/v1alpha2
kind: RedisClass
metadata:
  name: azure-redis-standard
  namespace: azure-infra-dev
  labels:
    engineVersion: "3.2"
specTemplate:
  resourceGroupName: group-westus-1
  location: West US
  sku:
    name: Basic
    family: C
    capacity: 0
  enableNonSslPort: true
  providerRef:
    name: example
    namespace: azure-infra-dev
  reclaimPolicy: Delete
```

### Implementation Details

There are a lot of quotation marks in the above steps because what would actually be happening under the hood would be more like a race to satisfy the claim. In Crossplane there is not one controller for each resource claim kind, but rather one controller for each possible (resource claim, managed resource) tuple. This allows an infrastructure stack that adds support for a managed resource that could satisfy a resource claim to implement the dynamic provisioning and claim binding logic for said claim without having to touch Crossplane core. Put otherwise, `crossplaneio/stack-gcp` can add support for `CloudSQLInstance` resources to satisfy `MySQLInstance` claims in `crossplaneio/stack-gcp`, instead of teaching `crossplaneio/crossplane` how to dynamically provision a `CloudSQLInstance`. Hence the race. In reality the following would happen for the example above:

* All resource claim controllers are updated to only reconcile resources with a `resourceRef` or a `classRef`.
* A new "class selector" controller is introduced for each (resource claim, resource class) tuple. Its job is to set the `classRef` for resource classes that omit it, by using their `classSelector`.

Upon the creation of a `PostgreSQLInstance` without a `classRef` or `resourceRef`:

1. Every class selector controller watching for `PostgreSQLInstance` claims has a reconcile queued. I'll use the `(PostgreSQLInstance, CloudSQLInstanceClass)` controller for this example, but `(PostgreSQLInstance, RDSInstanceClass)` and `(PostgreSQLInstance, MysqlServerClass)` controllers would run through the process in parallel.
1. The controller would list all `CloudSQLInstanceClass` resources (in any namespace) that matched the `PostgresSQLInstance` claim' label selectors.
1. If no `CloudSQLInstanceClass` matched the labels, the reconcile is done. Otherwise, one of the matching `CloudSQLInstanceClass` resources is selected at random.
1. The controller sleeps for a small, jittered amount of time then sets the `classRef` of the `PostgreSQLInstance` to the selected `CloudSQLInstanceClass`. The jitter reduces the chance of two controllers trying to set the `classRef` at the same time. If two controllers _do_ try to set the `classRef` at the same time one will fail to commit the change due to the `PostgreSQLInstance` claim's resource version having changed since it was read.
1. The reconcile is done. With the `classRef` set the `PostgresSQLInstance` now passes the watch predicates of the `(PostgreSQLInstance, CloudSQLInstance)` resource claim reconciler, which dynamically provisions and binds it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match claims to classes using label selectors #870

What problem are you facing?

How could Crossplane help solve your problem?

User experience

Implementation Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Match claims to classes using label selectors #870

Description

What problem are you facing?

How could Crossplane help solve your problem?

User experience

Implementation Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions