Skip to content

Add DynamicProvisioningScheduling support for EBS#65730

Merged
k8s-github-robot merged 1 commit intokubernetes:masterfrom
ddebroy:ebs-affinity1
Aug 1, 2018
Merged

Add DynamicProvisioningScheduling support for EBS#65730
k8s-github-robot merged 1 commit intokubernetes:masterfrom
ddebroy:ebs-affinity1

Conversation

@ddebroy
Copy link
Copy Markdown
Contributor

@ddebroy ddebroy commented Jul 2, 2018

What this PR does / why we need it:

This PR adds support for the DynamicProvisioningScheduling feature in EBS. With this in place, if VolumeBindingMode: WaitForFirstConsumer is specified in a EBS storageclass and DynamicProvisioningScheduling is enabled, EBS provisioner will use the selected node's LabelZoneFailureDomain as the zone to provision the EBS volume in.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
Related to #63232

Sample describe pv output with NodeAffinity populated:

~$ kubectl describe pv pvc-f9d2138b-7e3e-11e8-a4ea-064124617820
Name:              pvc-f9d2138b-7e3e-11e8-a4ea-064124617820
Labels:            failure-domain.beta.kubernetes.io/region=us-west-2
                   failure-domain.beta.kubernetes.io/zone=us-west-2a
Annotations:       kubernetes.io/createdby=aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller=yes
                   pv.kubernetes.io/provisioned-by=kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      slow3
Status:            Bound
Claim:             default/pvc3
Reclaim Policy:    Delete
Access Modes:      RWO
Capacity:          6Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-west-2a]
                   failure-domain.beta.kubernetes.io/region in [us-west-2]
Message:           
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-west-2a/vol-0fc1cdae7d10860f6
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

Release note:

none

/sig storage
/assign @msau42 @jsafrane

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Jul 2, 2018
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 2, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2018
Copy link
Copy Markdown
Member

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add an e2e test for this? Under test/e2e/storage. You can probably reuse parts of the volume_provisioning.go tests, but you'll need to modify the StorageClass and create a pod too.

New multi-zonal e2es can be considered as a separate PR too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can suggest to use allowedTopologies instead

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or zone/zones cannot be specified with VolumeBindingMode = WaitForFirstConsumer.

Copy link
Copy Markdown
Contributor Author

@ddebroy ddebroy Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 , from what I understand, with DynamicProvisioningScheduling enabled, AllowedTopologies will only be relevant for scenarios like GCE where multi-zonal volumes are supported, correct? In case of EBS, I am ignoring AllowedTopologies. I will update the error msg to what Jan suggested above.

Should we return error in case of EBS if AllowedTopologies is specified and the selected node is not in list of zones specified in AllowedTopologies?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, AllowedTopologies is relevant to single-zone volumes too. It basically replaces the "zone" and "zones" parameters, and allows an admin to restrict the zones where volumes can be provisioned in. The difference is the scheduler will respect it when choosing a node.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I think the provision implementation of single-zone volumes should do:

  1. node not set, allowed topologies not set: old behavior
  2. node not set, allowed topologies set: choose from allowed topologies
  3. node set, zone/zones set: error always?
  4. node set, allowed topologies set or not set: use node's zone
  5. zone/zones set + allowedtopologies set: error

@msau42
Copy link
Copy Markdown
Member

msau42 commented Jul 2, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 2, 2018
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a volume util that gce pd could use too

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @verult may have some comments here since he is looking at doing a similar allowedTopologies conversion for CSI

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will refactor this area a bit to create a common populateVolumeZones that we can share between GCE (single zone) PD, EBS (and maybe Azure Disk when it gets zone support). The provider-specific volume options like kmskeyid / encryption should probably stay in the above provider-specifc files' CreateVolume.

For Multi-zonal PDs, the logic here is going to be quite different.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if multi-zonal PDs will be different. This function is just creating a list of allowed zones. The only difference is that for multi-zonal PDs, it will pick 2 zones from the list instead of 1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll be able to reuse the logic for CSI conversion as it'll be more complicated, for example there isn't a one-to-one mapping between allowedTopology and CSI topology representation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll be able to reuse the logic for CSI conversion as it'll be more complicated, for example there isn't a one-to-one mapping between allowedTopology and CSI topology representation.

Can you give an example of what you mean? If there is no one-to-one mapping, then how will you be able to convert it?

Copy link
Copy Markdown
Contributor

@verult verult Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a volume util that gce pd could use too

Should we pull it out into volume util when we get to the GCE plugin? It shouldn't be too hard to do it later on. It's easy to make the wrong assumptions about commonalities if this is done prematurely. I think this is more of a problem when creating abstraction boundaries so I don't have a strong preference either way for this case.

  1. Flatten allowed topologies to a list of maps (csi topology format)

I don't think we should complicate the logic in in-tree volume plugins for the sake of aligning with CSI implementation details.

Can you give an example of what you mean? If there is no one-to-one mapping, then how will you be able to convert it?

AllowedTopologies can have ORs of ANDs of ORs, so there are multiple possible representations of the same topology constraint. CSI only has ORs of ANDs. To convert from CSI representation to AllowedTopologies, we could pick a canonical representation, similar to how we deal with the same issue with volume node affinity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's consider refactoring this when we get to the CSI implementation. But I feel the in-tree plugins use of allowed topologies is a subset of the CSI functionality, so it should be possible to get the list of zones out of the CSI representation. Plus, here we're only talking about converting from AllowedTopologies -> CSI, not the other way around.

So for now, I think a common util that only handles zones should be fine

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSI only has ORs of ANDs

@verult Why does CSI not support ORs of ANDs of ORs? For multi-zonal implementations like GCE PD, the "last" ORs bit corresponding to the values is where the possible set of zones are specified:

allowedTopologies:
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - us-central1-a
    - us-central1-b
    - us-central1-c

Will the above convention change for multi-zonal volumes in CSI?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU we could similarly have

allowedTopologies:
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - us-central1-a
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - us-central1-b
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - us-central1-c

to represent the same thing (in both cases we are saying, these are the zones the volume plugin is free to choose from), and an equivalent form of this is how CSI represents it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a set? The same zone could have mistakenly been specified multiple times.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think API validation should enforce the following semantics, if it doesn't already:

  • There are no repeats in requirement Values
  • Each TopologySelectorTerm is unique (regardless of the ordering of any array fields)
  • Map semantics for MatchLabelExpressions

In other words, the type of AllowedTopologies is essentially set( set ( map(string, set[string]))), and the API should preserve these set and map semantics.

If the above is satisfied then we don't need a set here, given that zones are unique per region.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test with some examples. But if it doesn't validate, we'll have to think about if we can strengthen it; the validation reuses much of the existing selector validation, and we can't tighten that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the validation reuses much of the existing selector validation, and we can't tighten that.

If it does turn out that current validation isn't sufficient, then I'd argue either the existing selector validation is somehow not strong enough, or maybe we shouldn't use these selectors

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a bug for that? We can look into fixing that separately

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the current validation code briefly and it doesn't seem like we have any structural validation today.

I'm happy to take this task if folks agree this is a good idea: #66184

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should error anything other than zone is specified. Because that is information the user expects us to use, but we didn't. cc @verult

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess they could be arbitrary labels. The example in https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md#multi-label-volume refers to key: foo.io/rack which should be okay?

Copy link
Copy Markdown
Contributor Author

@ddebroy ddebroy Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore above .. the above example is not relevant. The one here does refer to key: rack in addition to key: zone. So the user should be allowed to specify arbitrary keys in case some other code elsewhere down the line decides to act on other keys?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topology keys are a little different than label keys in Kubernetes, in that the topology keys are defined by the storage plugin and not the user, and the topology labels define the inherent restrictions of the storage provider. So I think we shouldn't mix plugin and user-defined keys in allowedTopologies. I think the safest course of action now is to error if there's a key we don't recognize.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more test case for single zone param + allowed topologies

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nodeWithLabels + some allowedTopologies?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably need to disable the feature gate in the else case, to handle the times when you call continue

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 11, 2018
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary? the !ok case below should be sufficient

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in API validation instead, if it's not already there? @msau42

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think API validation should enforce the following semantics, if it doesn't already:

  • There are no repeats in requirement Values
  • Each TopologySelectorTerm is unique (regardless of the ordering of any array fields)
  • Map semantics for MatchLabelExpressions

In other words, the type of AllowedTopologies is essentially set( set ( map(string, set[string]))), and the API should preserve these set and map semantics.

If the above is satisfied then we don't need a set here, given that zones are unique per region.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll be able to reuse the logic for CSI conversion as it'll be more complicated, for example there isn't a one-to-one mapping between allowedTopology and CSI topology representation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Comment on what "reject" means

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nodeWithLabels + some allowedTopologies?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"failed since zones parameter and multiple zones specified...

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 21, 2018
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jul 23, 2018

@msau42 @verult @jsafrane I have addressed the CR comments above and moved the common logic around zone population to PopulateZoneConfig in utils (so that GCE PD as well as Azure Disk can use it). PTAL.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this PR should touch block volume code. Is it result of unsuccessful merge? VolumeMode is already set in VolumeMode: volumeMode ~35 lines above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this - will fix. Indeed a case of a bad merge.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document somewhere that only LabelZoneFailureDomain is supported on AWS (in a separate PR).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it return error if allowedZones.Len() == 0 ? Such allowedTopologies are not very useful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was not sure if #66184 will check for and reject allowedZones.Len() == 0. @verult let me know if we need to check for and reject allowedZones.Len() == 0 here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if api validation could handle this case.

you can have allowed topologies > 0 but not actually specify the zone label

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors start with lowercase

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors start in lowercase

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field names start in lowercase

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors start with lowercase

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileld / object names start with lowercase

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nr. of replicas is hardcoded to 1, so lot of code below is useless. I'd suggest to focus on the topology in this PR and then send a subsequent one with multi-zonal volumes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aws cloud provider interface abstraction seems a bit off to me. I feel like the aws cloud provider layer should not be processing the zone and zones arguments, which are StorageClass parameters. It seems better suited for the Kubernetes plugin layer to process, and pass in the selected zone to the aws cloud provider layer. Right now, both layers have to handle all the permutations of zone and zones.

If the aws cloud layer is refactored to push the zone argument processing up to the aws plugin, then I think this PopulateZoneConfig could be greatly simplified, and only needs to return a list of zones, either from allowedTopologies or from old zone/zones. @jsafrane wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With an extra function in the CloudProvider interface to expose something like GetAllCurrentZones/getCandidateZonesForDynamicVolume (that PopulateZoneConfig can invoke for the no zone/zones specified scenarios), we should be able to refactor all the zone/zones processing in CreateDisk here into PopulateZoneConfig.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 26, 2018
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jul 28, 2018

Addressed CR comments so far with the following major changes:

  1. Refactored the zone/zones parameter processing completely into the volume layer so that the cloud layer can just pass through a single zone to the ec2 APIs.
  2. Removed extra logic (and tests) around handling of multi replicas. Will put that logic in a separate PR for GCE PD.

The Azure zone awareness PRs are also coming in now and that can use the common util in this PR for zonal volume creation.

@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jul 30, 2018

@jsafrane @msau42 @verult PTAL when you get a chance.

Copy link
Copy Markdown
Member

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
just a minor nit

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment when test args start, and when test expected results start

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 30, 2018
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jul 31, 2018

/test pull-kubernetes-e2e-gce

@msau42
Copy link
Copy Markdown
Member

msau42 commented Jul 31, 2018

/lgtm
@jsafrane can you also review, especially from aws perspective?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 31, 2018
@jsafrane
Copy link
Copy Markdown
Member

jsafrane commented Aug 1, 2018

AWS code is fine. Please squash the commits before it gets merged.

/approve
/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 1, 2018
@jsafrane
Copy link
Copy Markdown
Member

jsafrane commented Aug 1, 2018

/approve no-issue

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 1, 2018
Signed-off-by: Deep Debroy <ddebroy@docker.com>
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Aug 1, 2018

/test pull-kubernetes-e2e-gce-100-performance

@msau42
Copy link
Copy Markdown
Member

msau42 commented Aug 1, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 1, 2018
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ddebroy, jsafrane, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link
Copy Markdown

/test all [submit-queue is verifying that this PR is safe to merge]

@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Aug 1, 2018

/test pull-kubernetes-e2e-kops-aws

@k8s-github-robot
Copy link
Copy Markdown

Automatic merge from submit-queue (batch tested with PRs 65730, 66615, 66684, 66519, 66510). If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants