Skip to content

Balanced resource allocation priority to include volume count on nodes.#60525

Merged
k8s-github-robot merged 2 commits intokubernetes:masterfrom
ravisantoshgudimetla:scheduler-pvc
Mar 31, 2018
Merged

Balanced resource allocation priority to include volume count on nodes.#60525
k8s-github-robot merged 2 commits intokubernetes:masterfrom
ravisantoshgudimetla:scheduler-pvc

Conversation

@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor

@ravisantoshgudimetla ravisantoshgudimetla commented Feb 27, 2018

Scheduler balanced resource allocation priority to include volume count on nodes.

/cc @aveshagarwal @abhgupta

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #58232

Release note:

Balanced resource allocation priority in scheduler to include volume count on node 

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 27, 2018
@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Mar 1, 2018
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variance's value may be zero, there may crash, you'd better verify it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way for it to happen is requested memory, cpu etc are 0's or < 0, in either case we won't be reaching any of the priority functions but I understand your point. I will add a check.

@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

/cc @k82cn @bsalamat

@k8s-ci-robot k8s-ci-robot requested review from bsalamat and k82cn March 2, 2018 16:51
@ravisantoshgudimetla ravisantoshgudimetla force-pushed the scheduler-pvc branch 2 times, most recently from 901f801 to 6b7b83b Compare March 3, 2018 23:31
@ravisantoshgudimetla ravisantoshgudimetla changed the title [WIP] Balanced resource allocation priority to include volume count on nodes. Balanced resource allocation priority to include volume count on nodes. Mar 3, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2018
@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would rename to BalanceAttachedNodeVolumes and would update the comment to say that this impacts scheduling decisions and scheduler tries to place a Pod that requires an attached volume on a Node with fewer volumes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the blank line.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have shared volumes?

cc/ @msau42

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you care about counting bind mounts? There is always a bind-mount (done by the container runtime) for all the volumes referenced in the container spec.

At the pod level, you can share between pods, but the number of mounts depends on the volume type. Sharing between pods on the same node is not as common. Typically, scheduler tries to spread pods across nodes.

  • In the case of NFS, each pod has its own NFS mount.
  • For something like GCE PD (sharing should be a rare use case), you will have only one global node mount, but a bind mount per pod.
  • Volume types like EmptyDir, ConfigMap, Secrets, downward api are not mounted
  • Hostpath volumes can be mounted or not, we can't tell from the spec.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a function used in testing, thats why I did not bother checking that, if you think its important, I can add a check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ravi for pointing out. I missed the fact that it is a test only function. We don't need to change it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeInfo is a part of scheduler cache, but we update the information in this predicate. This is not a good idea. If the prediate is not executed, the information in the cache gets stale. This could lead to subtle bugs. Why don't we update the values in NodeInfo.AddPod() and .RemovePod()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we are not caching anything now. For example, say we chose node n1 during pod p1's scheduling. When we are scheduling next pod p2, we are going to recompute the volumes attached, max number of volumes again. As I mentioned in the comment, for lack of proper data structure through we can share information across predicates and priorities, I am using it. I don't say that this clear or better. I am open to suggestions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we are not caching anything now.

I understand you don't need to cache anything and this is just a way of passing information around, but the nodeInfo pointer is a part of scheduler's cache. When you update the fields of the struct, you are effectively keeping the information in the scheduler's cache. The existance of such information in the scheduler cache would be confusing for someone who reads the code later.
Ideally, we should keep this in some temporary state that is valid in a single schedule cycle and is destructed at the end of the cycle, but we don't have such a structure and we need to add it if needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let me think of some alternate ways to do this and get back to you. Thanks again for the review.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Bobby, prefer not to update NodeInfo outside of cache.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variance could be larger than 1, causing 1 - variance (below) to be negative. I guess you have forgotten to divide by 3.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you have forgotten to divide by 3.

🤣 Lol, thanks for catching it :)

variance could be larger than 1, causing 1 - variance (below) to be negative.

I thought about that, but it seems our sample set always has fractions and

  • fraction-mean(which is a fraction) would be fraction again.
  • square-root of (fraction-mean) will be further lower than original fraction

Not sure if I missed any edge case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but given that you have not divided by 3, it could be larger than 1. For example, if each component is 0.9, then variance will be 0.81 + 0.81 + 0.81. If you divide by 3, variance will be less than 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you were commenting on my version of variance(without division by 3) I thought you were commenting on variance in general that it could be greater than 1(which is true as well). Sure, I will make that change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was referring to this implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the explanation of variance, not sure why we use it?

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we wanted to know how close CPU, memory utilization, no of volumes on nodes are to each other. The closer means the node is well balanced. I tried explaining the same here-#58232 (comment)

I could have used standard deviation as well but I thought of stopping at variance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, got that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe standard deviation is better; it'll also address Bobby's comments above :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just deviding by 3 addresses my comment and since we want to compare one node against others, standard deviation and variance both work equally well here. Given that taking square root is computationally expensive, I would just use variance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, computational cost was the reason that I stopped going towards standard deviation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to me :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This volume predicate only counts for 3 volume types, and only one of them can run in a single cluster.

  • GCE PD
  • AWS EBS
  • Azure disk

Did you also want to count mounts for other volume types not included here, like NFS, Gluster, Ceph?

Also, the limits here are more about how many disks can be attached to a single node, not how many could be mounted. So if you had multiple pods sharing the same volume on the same node, this predicate would only count 1 of them, even though there will be a mount for each pod. But multiple pods sharing the same volume on a single node is not the common use case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, it may be possible that you can have multiple volume types in the same cluster that each have their own max limit as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information @msau42. I thought we are going in the direction of using a configmap which consists of max number of volumes per node based on cloud provider and CSI plugin, after which we can use configmap to get the information.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main challenge is that this predicate is only meant to restrict max number of attached volumes, not overall number of mounts. If you want to spread number of attached volumes, then this predicate will work.

But it sounds like you want to spread out number of mounts, including for non-attachable plugins.

If that's the case, then you may need to consider a different method of counting mounts, such as just counting the total number of volumes in assigned pods. But that still won't be completely accurate because each volume type may create a different number of mounts, like I mentioned earlier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it sounds like you want to spread out number of mounts, including for non-attachable plugins.

So, we wanted to find nodes that have balanced number of volumes here. Not sure where this PR gave the impression of spreading mounts. While finding node that has balanced mounts will be even more useful, I think it has to be dependent on lower level API talking to cloud provider or listing /etc/mtab in case of linux etc and is beyond the scope of what we want to do here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe improving the wording in the comments here to explicitly say we're balancing attachable volume cpunts would make it more clear.

Apologies if the comments are causing confusion. I thought I made it clear atleast from the start of PR, if you think otherwise, I can edit the existing comments. Please let me know, where the confusion is. Even if it is in code, I am happy to modify it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can be more explicit in some of the variable names. Like, instead of AllocableVolumes, something like AttachableVolumesLimit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think, AttachableVolumesLimit actually reflects the allocatable volumes on the node. My intention there was to have a variable which holds the value for number of allocatable volumes currently on the node. AttachableVolumesLimit may be more suited for c.maxVolumes. I don't have strong opinion on it though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, yes, what about something like "CurrentAttachedVolumesCount"? I want to be explicit that we are talking about attachable counts here. There is another completely unrelated volume feature that uses the term "allocatable", so it would be better to align with the terminology that we use in the documentation and API fields.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can do that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to also merge volumes into Resource? so no interface changed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but it comes mostly under scalar resource and I think there is no definitive way on kubelet side to introspect node and get that information but it would be ideal to have that kind of information there. @bsalamat WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to also merge volumes into Resource? so no interface changed.

That would require updating the scheduler cache with the volume information and that in turn would require Kubelets to report volume counts, etc. as a part of allocatable resources. This would be a larger project requiring involvement of the node team. I won't go this path, if we can solve the problem without changes to the kubelet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnufied is working on a design for dynamic node reporting of attachable volume limits per plugin. So I think when that is implemented, the limits will actually will be in the Node object.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when that is implemented, the limits will actually will be in the Node object.

That will be great and will make our logic cleaner.

Copy link
Copy Markdown
Contributor Author

@ravisantoshgudimetla ravisantoshgudimetla Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think when that is implemented, the limits will actually will be in the Node object.

Thanks @msau42 . That will be awesome. Ohh btw, then we could actually cache in nodeinfo :).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may related to our previous comments: did not update nodeInfo outside of cache :)

I think when that is implemented, the limits will actually will be in the Node object.

That will be great and will make our logic cleaner.

Great, @gnufied any related doc/issue?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 8, 2018
@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

@bsalamat @k82cn I have attempted to create a transient data structure which should live only for duration of a single scheduling cycle. Commit 7359fe2794051fdabab8abaa359a343d2fefe74c has it.

Sample usage of structure could be found at https://github.com/ravisantoshgudimetla/kubernetes/blob/c6b9f133fe381f91394df02b6cf0f4c41dd5d6fe/pkg/scheduler/algorithm/priorities/resource_allocation.go. Please let me know your thoughts on the approach.
cc @aveshagarwal

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2018
@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

ravisantoshgudimetla commented Mar 16, 2018

ping @bsalamat @k82cn. I will fix the test cases but before that want your opinions on the approach taken and if it makes sense. This could wait till 1.10 is out.

@bsalamat
Copy link
Copy Markdown
Member

@ravisantoshgudimetla
Two issues that I see with the current approach:

  1. It needs passing the transient node info to predicate and priority functions. This needs changes to many places including modules other than the scheduler. Unfortunately, I can't think of a clean option at the moment. One not-so-clean option is to add the new structure under NodeInfo with "Transient" in the name and a clear comment that the information in this struct is valid only during one schedule cycle.
  2. Predicate and priority functions run in parallel for different nodes. So, reading and writing the transient node info requires synchronization.

@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

Thanks @bsalamat

It needs passing the transient node info to predicate and priority functions. This needs changes to many places including modules other than the scheduler.

Thats why PR has grown to 39 files and counting :(.

One not-so-clean option is to add the new structure under NodeInfo with "Transient" in the name and a clear comment that the information in this struct is valid only during one schedule cycle.

I have started on similar lines but putting this struct in nodeinfo struct would be cleaner. I will use this approach.

Predicate and priority functions run in parallel for different nodes. So, reading and writing the transient node info requires synchronization.

Good point. Till now, most of fields in struct are being updated only at one location. I will make sure that we have locking in place for transient node info.

@k8s-ci-robot k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 22, 2018
@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

/retest

@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

@bsalamat I addressed your comments. PTAL.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 29, 2018
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need the "if" part.

In fact, given that TransientInfo is initialized with nodeInfo, please remove the whole else part and not initialize it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done Bobby. PTAL.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing. This method is not needed. I would remove and inline the body in resetTransientSchedulerInfo to prevent future mistakes of using it directly (without locking).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks. I did not want to have a big function which resets all the transient values, that was the reason, I separated that block into different function. But I understand your point of calling this function directly without acquiring lock.

Copy link
Copy Markdown
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks, @ravisantoshgudimetla!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2018
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2018
@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

/retest

@k8s-github-robot
Copy link
Copy Markdown

Automatic merge from submit-queue (batch tested with PRs 54997, 61869, 61816, 61909, 60525). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 9847c8e into kubernetes:master Mar 31, 2018
@ravisantoshgudimetla ravisantoshgudimetla deleted the scheduler-pvc branch March 31, 2018 14:36
@gnufied
Copy link
Copy Markdown
Member

gnufied commented May 3, 2018

@k82cn @msau42 @ravisantoshgudimetla @bsalamat I am sorry I did not get a chance to comment on this before, but one big problem with what we merged in this PR is - it will only work for Azure, AWS and GCE volume types. It will not work for any other volume type.

Also it sounds like this proposal set out to fix "balance PVCs on node", but in reality because it requires a maximum limit of volumes, it will only EVER work for volume types that have such upper limits. In a nutshell - this proposal will perhaps not work for something like glusterfs, iscsi etc for foreseeable future.

@ravisantoshgudimetla
Copy link
Copy Markdown
Contributor Author

@gnufied I think this works for any volume type as long as max limit on volumes parameter is set.

it will only EVER work for volume types that have such upper limits.

I think the limit is coming from max number of volumes that could be attached to a node irrespective of volume types like glusterfs, iscsi etc. IOW, even if glusterfs provides unlimited volumes, there is a limit on the number of volumes that could be attached to a machine depending on Operating system running and filesystem used.

@gnufied
Copy link
Copy Markdown
Member

gnufied commented May 4, 2018

The problem I was talking about is - since Kubernetes have no knowledge of applicable limits for volume types like glusterfs etc, the balanced volume allocation will not work for it. Even though, there indeed will be a theoretical limit for glusterfs etc (before it saturates the network), there is no way that information is exposed inside kubernetes. There is no way for admin to set those limits.

Discussed this offline with @jsafrane , @childsb and we think that - it may be possible to return a "dummy" upper limit for those volumes types (a high number such as 1000 lets say) - so as fractions that this proposal requires can still be calculated. Do we really care about real maximum volume count for purpose of calculating the fractions? It appears to me that - any high number should do the job and still this design will work? @ravisantoshgudimetla correct me if I am wrong.

ravisantoshgudimetla added a commit to ravisantoshgudimetla/kubernetes that referenced this pull request Jun 22, 2021
kubernetes#60525 introduced
Balanced attached node volumes feature gate to include volume
count for prioritizing nodes. The reason for introducing this
flag was its usefulness in Red Hat OpenShift Online environment
which is not being used any more. So, removing the flag
as it helps in maintainability of the scheduler code base
as mentioned at kubernetes#101489 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Considering PVC as a resource for balanced resource utilization in the scheduler

8 participants