Skip to content

Populate NodeAffinity on top of labels for cloud based PersistentVolumes#64226

Merged
k8s-github-robot merged 1 commit intokubernetes:masterfrom
ddebroy:ddebroy-affinity1
Jul 9, 2018
Merged

Populate NodeAffinity on top of labels for cloud based PersistentVolumes#64226
k8s-github-robot merged 1 commit intokubernetes:masterfrom
ddebroy:ddebroy-affinity1

Conversation

@ddebroy
Copy link
Copy Markdown
Contributor

@ddebroy ddebroy commented May 23, 2018

What this PR does / why we need it:

This PR populates the NodeAffinity field (on top of the existing labels) for PVs backed by cloud providers like EC2 EBS and GCE PD.

Special notes for your reviewer:
Related to #63232

Sample describe pv output for EBS with node affinity field populated:

kubectl describe pv pv0001
Name:              pv0001
Labels:            failure-domain.beta.kubernetes.io/region=us-west-2
                   failure-domain.beta.kubernetes.io/zone=us-west-2a
Annotations:       <none>
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      
Status:            Available
Claim:             
Reclaim Policy:    Retain
Access Modes:      RWO
Capacity:          5Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-west-2a]
                   failure-domain.beta.kubernetes.io/region in [us-west-2]
Message:           
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   vol-00cf03a068c62cbe6
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

/sig storage
/assign @msau42

Release note:
NONE

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 23, 2018
@msau42
Copy link
Copy Markdown
Member

msau42 commented May 23, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 23, 2018
@msau42
Copy link
Copy Markdown
Member

msau42 commented May 23, 2018

/assign @verult

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So on GCE, we support Regional Disks, which are replicated in two zones. To represent that, we actually stuffed both zones in a single label.

You'll have to treat the zone label specially. And you'll need to use volumeutil.LabelZonesToSet in order to parse this label format. And then, each zone should be included as a separate element in the Values. See the example pv node affinity here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drawing parallels with the find*Labels() methods, we could have something like translate*Labels() that implement plugin-specific translation logic like this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use append on requirements instead of keeping track of i

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only process/set NodeAffinity if VolumeScheduling feature is enabled.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we verifying this doesnt conflict with user provided labels and overwrites them ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure these labels passed in are the ones given by the cloud provider.

You also bring up a good point. If the user already specified a NodeAffinity in the PV, we should probably not override it.

Copy link
Copy Markdown
Contributor Author

@ddebroy ddebroy May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point indeed. I did think about potential conflicts between user-provided labels and those applied by the system but based on this unit-test I thought overriding is(/was) preferred way: https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/storage/persistentvolume/label/admission_test.go#L161

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this briefly with @saad-ali and @verult. We think it's better if there's any conflict between user provided and cloud provider provided values, then we should error instead of blindly overwriting. Users should know that they made a configuration mistake because it could indicate that they are accidentally using the wrong volume than intended.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thinking about how to detect conflicts and "merge" with an arbitrary pre-specified affinity in a safe way, the following scheme should work. Thoughts?

Scan through each NodeSelectorTerm in NodeAffinity.Required:
  Create a map <NodeSelectorRequirement, bool> populated with NodeSelectorRequirement s corresponding to cloud specific labels to record if a matching NodeSelectorRequirement has been found in current NodeSelectorTerm
    Scan through each NodeSelectorRequirement in the NodeSelectorTerm:
        If NodeSelectorRequirement.Key matches but operator or values do not match, return error due to conflict.
        If key, operator, value of NodeSelectorRequirement matches entry in map, mark it in map.
    Append NodeSelectorRequirement entries from map that have not yet been marked/matched with NodeSelectorRequirements (corresponding to cloud specific label) in current NodeSelectorTerm

This should result in the NodeSelectorRequirement entries corresponding to cloud specific labels being "and"ed to each NodeSelectorTerm entry of an arbitrarily specified NodeAffinity.Required block.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could work; alternatively maybe we just want to leave the NodeAffinity alone if the user already specified something?

Copy link
Copy Markdown
Contributor

@verult verult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need for a way to convert from PV NodeAffinity back to zone labels? For example, when a new volume is created with VolumeScheduling enabled, AFAIU only has PV NodeAffinity, no labels. If the feature is then turned off, the cluster still needs labels to schedule correctly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drawing parallels with the find*Labels() methods, we could have something like translate*Labels() that implement plugin-specific translation logic like this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could work; alternatively maybe we just want to leave the NodeAffinity alone if the user already specified something?

@msau42
Copy link
Copy Markdown
Member

msau42 commented May 24, 2018

For example, when a new volume is created with VolumeScheduling enabled, AFAIU only has PV NodeAffinity, no labels. If the feature is then turned off, the cluster still needs labels to schedule correctly.

We should still keep the old behavior with labels for backwards compatibility exactly for that reason

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 30, 2018
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented May 30, 2018

alternatively maybe we just want to leave the NodeAffinity alone if the user already specified something

@verult I will update the code with the above logic - it is certainly simple. I did want to confirm a couple of things around this:

  1. If the user specifies a NodeAffinity with any NodeSelectorRequirement (even completely unrelated to the cloud zones/regions) for a PV, we will completely skip appending/merging any NodeSelectorRequirement related to the clouds into the PV.NodeAffinity. The user in this case is therefore expected to manually specify NodeSelectorRequirements appropriately for the cloud.

  2. Is it possible, that just like the PV labelling controllers, there maybe another component/controller that tries to assign certain NodeSelectorRequirements to PV.NodeAffinity? If so, without any coordination between the controllers, there may be inconsistent behavior as the controller that gets to process the PVs first will append it's own NodeSelectorRequirements while the other ones back off upon detecting a NodeAffinity already populated. One way to resolve this would be to have a common "merging-of-NodeSelectorRequirements" mechanism for affinity that all controllers use so that the user will end up with consistent results. Let me know if this is too advanced of a scenario right now and okay to postpone for later.

@verult
Copy link
Copy Markdown
Contributor

verult commented Jun 4, 2018

Thanks for the great points

If the user specifies a NodeAffinity with any NodeSelectorRequirement (even completely unrelated to the cloud zones/regions) for a PV, we will completely skip appending/merging any NodeSelectorRequirement related to the clouds into the PV.NodeAffinity. The user in this case is therefore expected to manually specify NodeSelectorRequirements appropriately for the cloud.

Right, an argument could be made that if a user is already aware of PV NodeAffinity and is working with it, it's reasonable to expect them specify it correctly and completely.

As for (2), regardless of whether a controller exists today, it'd be nice to be future-proof here and avoid headaches if another controller writing to PV NodeAffinity is created in the future. Now I'm leaning towards your original proposed merging method for this reason. As long as the process is only additive to the field we should be OK. One thing to consider, as @msau42 pointed out offline, is there are multiple ways to express the same topology using NodeAffinity. See here for an example with CSI, but this issue could come up with Regional PDs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check nil ptrs when dereferencing each field. Same for all tests

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could sort the two lists and do a DeepEqual

@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jun 5, 2018

I will go ahead and implement the additive/merge logic for NodeSelectorRequirement elements corresponding to the zone and regions labels as discussed above. Will update this PR.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 12, 2018
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jun 12, 2018

I have implemented the merge mechanism for NodeSelectorRequirements corresponding to the cloud labels as discussed earlier.

@msau42
Copy link
Copy Markdown
Member

msau42 commented Jun 18, 2018

lgtm, go ahead and squash the commits

@verult could you test this out with repd?

@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jun 18, 2018

commits squashed

@verult
Copy link
Copy Markdown
Contributor

verult commented Jun 21, 2018

Manually tested using regional PDs for the following cases:

  • Dynamically provisioned regional PD
  • Pre-provisioned PV without NodeAffinity
  • Pre-provisioned PV with only zone in NodeAffinity; works as expected - region information is not filled in
  • Pre-provisioned PV with both zone and region info.

Everything looks good!

@msau42
Copy link
Copy Markdown
Member

msau42 commented Jun 21, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 21, 2018
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jun 21, 2018

/assign @thockin @derekwaynecarr

Copy link
Copy Markdown
Contributor

@verult verult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, LGTM otherwise

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This block can be moved to line 249

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: populateAffinity := utilfeature.DefaultFeatureGate.Enabled(features.VolumeScheduling) && len(volLabels) != 0, so you don't need the populateAffinity = true later

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What about if reqs is empty?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added case to return false for empty reqs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if tc.affinityMerged {
        if obj.Spec.NodeAffinity == nil {
                ....
        }
        if obj.Spec.NodeAffinity.Required == nil {
...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block removed as no longer necessary

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid this logic here - it's part of the same logic we want to test. Instead of affinityMerged, what if we had expectedNodeAffinity? We should be able to directly verify the result without having to analyze implementation details in the test like merging.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded expectations and matching with target using deepequal.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 24, 2018
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jun 26, 2018

Addressed CR comments from @verult, rebased and squashed commits

Signed-off-by: Deep Debroy <ddebroy@docker.com>
@verult
Copy link
Copy Markdown
Contributor

verult commented Jun 29, 2018

/lgtm

Thanks Deep!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2018
@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jun 29, 2018

/assign @thockin

@ddebroy
Copy link
Copy Markdown
Contributor Author

ddebroy commented Jul 2, 2018

/assign @erictune @smarterclayton @luxas @wlan0 @deads2k

Adding some more folks for review/approval based on the OWNERS files mentioned above.

@thockin
Copy link
Copy Markdown
Member

thockin commented Jul 9, 2018

/approve

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ddebroy, msau42, thockin, verult

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 9, 2018
@k8s-github-robot
Copy link
Copy Markdown

Automatic merge from submit-queue (batch tested with PRs 64226, 65880). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 349d6a6 into kubernetes:master Jul 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.