Balanced resource allocation priority to include volume count on nodes. by ravisantoshgudimetla · Pull Request #60525 · kubernetes/kubernetes

ravisantoshgudimetla · 2018-02-27T21:34:38Z

Scheduler balanced resource allocation priority to include volume count on nodes.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #58232

Release note:

Balanced resource allocation priority in scheduler to include volume count on node

ravisantoshgudimetla · 2018-03-01T22:08:56Z

/sig scheduling

wgliang · 2018-03-02T09:04:50Z

pkg/scheduler/algorithm/priorities/balanced_resource_allocation.go

Variance's value may be zero, there may crash, you'd better verify it.

The only way for it to happen is requested memory, cpu etc are 0's or < 0, in either case we won't be reaching any of the priority functions but I understand your point. I will add a check.

ravisantoshgudimetla · 2018-03-02T16:51:28Z

/cc @k82cn @bsalamat

ravisantoshgudimetla · 2018-03-04T04:14:15Z

/retest

bsalamat · 2018-03-05T19:22:50Z

pkg/features/kube_features.go

nit: I would rename to BalanceAttachedNodeVolumes and would update the comment to say that this impacts scheduling decisions and scheduler tries to place a Pod that requires an attached volume on a Node with fewer volumes.

bsalamat · 2018-03-05T22:00:19Z

pkg/scheduler/algorithm/priorities/balanced_resource_allocation.go

Please remove the blank line.

bsalamat · 2018-03-05T22:11:02Z

pkg/scheduler/algorithm/priorities/balanced_resource_allocation_test.go

Can't we have shared volumes?

cc/ @msau42

Do you care about counting bind mounts? There is always a bind-mount (done by the container runtime) for all the volumes referenced in the container spec.

At the pod level, you can share between pods, but the number of mounts depends on the volume type. Sharing between pods on the same node is not as common. Typically, scheduler tries to spread pods across nodes.

In the case of NFS, each pod has its own NFS mount.

For something like GCE PD (sharing should be a rare use case), you will have only one global node mount, but a bind mount per pod.

Volume types like EmptyDir, ConfigMap, Secrets, downward api are not mounted

Hostpath volumes can be mounted or not, we can't tell from the spec.

This is a function used in testing, thats why I did not bother checking that, if you think its important, I can add a check.

Thanks Ravi for pointing out. I missed the fact that it is a test only function. We don't need to change it.

bsalamat · 2018-03-05T22:27:01Z

pkg/scheduler/algorithm/predicates/predicates.go

NodeInfo is a part of scheduler cache, but we update the information in this predicate. This is not a good idea. If the prediate is not executed, the information in the cache gets stale. This could lead to subtle bugs. Why don't we update the values in NodeInfo.AddPod() and .RemovePod()?

Actually, we are not caching anything now. For example, say we chose node n1 during pod p1's scheduling. When we are scheduling next pod p2, we are going to recompute the volumes attached, max number of volumes again. As I mentioned in the comment, for lack of proper data structure through we can share information across predicates and priorities, I am using it. I don't say that this clear or better. I am open to suggestions.

Actually, we are not caching anything now.

I understand you don't need to cache anything and this is just a way of passing information around, but the nodeInfo pointer is a part of scheduler's cache. When you update the fields of the struct, you are effectively keeping the information in the scheduler's cache. The existance of such information in the scheduler cache would be confusing for someone who reads the code later.
Ideally, we should keep this in some temporary state that is valid in a single schedule cycle and is destructed at the end of the cycle, but we don't have such a structure and we need to add it if needed.

Ok let me think of some alternate ways to do this and get back to you. Thanks again for the review.

Agree with Bobby, prefer not to update NodeInfo outside of cache.

bsalamat · 2018-03-05T22:29:57Z

pkg/scheduler/algorithm/priorities/balanced_resource_allocation.go

variance could be larger than 1, causing 1 - variance (below) to be negative. I guess you have forgotten to divide by 3.

I guess you have forgotten to divide by 3.

🤣 Lol, thanks for catching it :)

variance could be larger than 1, causing 1 - variance (below) to be negative.

I thought about that, but it seems our sample set always has fractions and

fraction-mean(which is a fraction) would be fraction again.

square-root of (fraction-mean) will be further lower than original fraction

Not sure if I missed any edge case.

You are right, but given that you have not divided by 3, it could be larger than 1. For example, if each component is 0.9, then variance will be 0.81 + 0.81 + 0.81. If you divide by 3, variance will be less than 1.

So, you were commenting on my version of variance(without division by 3) I thought you were commenting on variance in general that it could be greater than 1(which is true as well). Sure, I will make that change.

Yes, I was referring to this implementation.

k82cn · 2018-03-06T03:46:23Z

pkg/scheduler/algorithm/priorities/balanced_resource_allocation.go

Here's the explanation of variance, not sure why we use it?

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.

So, we wanted to know how close CPU, memory utilization, no of volumes on nodes are to each other. The closer means the node is well balanced. I tried explaining the same here-#58232 (comment)

I could have used standard deviation as well but I thought of stopping at variance.

Great, got that.

Maybe standard deviation is better; it'll also address Bobby's comments above :)

Just deviding by 3 addresses my comment and since we want to compare one node against others, standard deviation and variance both work equally well here. Given that taking square root is computationally expensive, I would just use variance.

Yes, computational cost was the reason that I stopped going towards standard deviation.

ok to me :)

msau42 · 2018-03-06T05:21:30Z

pkg/scheduler/algorithm/predicates/predicates.go

This volume predicate only counts for 3 volume types, and only one of them can run in a single cluster.

GCE PD

AWS EBS

Azure disk

Did you also want to count mounts for other volume types not included here, like NFS, Gluster, Ceph?

Also, the limits here are more about how many disks can be attached to a single node, not how many could be mounted. So if you had multiple pods sharing the same volume on the same node, this predicate would only count 1 of them, even though there will be a mount for each pod. But multiple pods sharing the same volume on a single node is not the common use case.

In the future, it may be possible that you can have multiple volume types in the same cluster that each have their own max limit as well.

Thanks for the information @msau42. I thought we are going in the direction of using a configmap which consists of max number of volumes per node based on cloud provider and CSI plugin, after which we can use configmap to get the information.

I think the main challenge is that this predicate is only meant to restrict max number of attached volumes, not overall number of mounts. If you want to spread number of attached volumes, then this predicate will work.

But it sounds like you want to spread out number of mounts, including for non-attachable plugins.

If that's the case, then you may need to consider a different method of counting mounts, such as just counting the total number of volumes in assigned pods. But that still won't be completely accurate because each volume type may create a different number of mounts, like I mentioned earlier.

But it sounds like you want to spread out number of mounts, including for non-attachable plugins.

So, we wanted to find nodes that have balanced number of volumes here. Not sure where this PR gave the impression of spreading mounts. While finding node that has balanced mounts will be even more useful, I think it has to be dependent on lower level API talking to cloud provider or listing /etc/mtab in case of linux etc and is beyond the scope of what we want to do here.

Maybe improving the wording in the comments here to explicitly say we're balancing attachable volume cpunts would make it more clear.

Apologies if the comments are causing confusion. I thought I made it clear atleast from the start of PR, if you think otherwise, I can edit the existing comments. Please let me know, where the confusion is. Even if it is in code, I am happy to modify it.

Maybe we can be more explicit in some of the variable names. Like, instead of AllocableVolumes, something like AttachableVolumesLimit.

I don't think, AttachableVolumesLimit actually reflects the allocatable volumes on the node. My intention there was to have a variable which holds the value for number of allocatable volumes currently on the node. AttachableVolumesLimit may be more suited for c.maxVolumes. I don't have strong opinion on it though.

Oh sorry, yes, what about something like "CurrentAttachedVolumesCount"? I want to be explicit that we are talking about attachable counts here. There is another completely unrelated volume feature that uses the term "allocatable", so it would be better to align with the terminology that we use in the documentation and API fields.

Sure. I can do that.

k82cn · 2018-03-06T07:06:01Z

pkg/scheduler/algorithm/priorities/balanced_resource_allocation.go

is it possible to also merge volumes into Resource? so no interface changed.

I thought about it but it comes mostly under scalar resource and I think there is no definitive way on kubelet side to introspect node and get that information but it would be ideal to have that kind of information there. @bsalamat WDYT?

is it possible to also merge volumes into Resource? so no interface changed.

That would require updating the scheduler cache with the volume information and that in turn would require Kubelets to report volume counts, etc. as a part of allocatable resources. This would be a larger project requiring involvement of the node team. I won't go this path, if we can solve the problem without changes to the kubelet.

@gnufied is working on a design for dynamic node reporting of attachable volume limits per plugin. So I think when that is implemented, the limits will actually will be in the Node object.

I think when that is implemented, the limits will actually will be in the Node object.

That will be great and will make our logic cleaner.

So I think when that is implemented, the limits will actually will be in the Node object.

Thanks @msau42 . That will be awesome. Ohh btw, then we could actually cache in nodeinfo :).

That may related to our previous comments: did not update nodeInfo outside of cache :)

I think when that is implemented, the limits will actually will be in the Node object.

That will be great and will make our logic cleaner.

Great, @gnufied any related doc/issue?

@k82cn @bsalamat kubernetes/community#2051

ravisantoshgudimetla · 2018-03-08T06:06:10Z

@bsalamat @k82cn I have attempted to create a transient data structure which should live only for duration of a single scheduling cycle. Commit 7359fe2794051fdabab8abaa359a343d2fefe74c has it.

Sample usage of structure could be found at https://github.com/ravisantoshgudimetla/kubernetes/blob/c6b9f133fe381f91394df02b6cf0f4c41dd5d6fe/pkg/scheduler/algorithm/priorities/resource_allocation.go. Please let me know your thoughts on the approach.
cc @aveshagarwal

ravisantoshgudimetla · 2018-03-16T14:08:16Z

ping @bsalamat @k82cn. I will fix the test cases but before that want your opinions on the approach taken and if it makes sense. This could wait till 1.10 is out.

bsalamat · 2018-03-16T17:31:57Z

@ravisantoshgudimetla
Two issues that I see with the current approach:

It needs passing the transient node info to predicate and priority functions. This needs changes to many places including modules other than the scheduler. Unfortunately, I can't think of a clean option at the moment. One not-so-clean option is to add the new structure under NodeInfo with "Transient" in the name and a clear comment that the information in this struct is valid only during one schedule cycle.
Predicate and priority functions run in parallel for different nodes. So, reading and writing the transient node info requires synchronization.

ravisantoshgudimetla · 2018-03-16T19:54:39Z

Thanks @bsalamat

It needs passing the transient node info to predicate and priority functions. This needs changes to many places including modules other than the scheduler.

Thats why PR has grown to 39 files and counting :(.

One not-so-clean option is to add the new structure under NodeInfo with "Transient" in the name and a clear comment that the information in this struct is valid only during one schedule cycle.

I have started on similar lines but putting this struct in nodeinfo struct would be cleaner. I will use this approach.

Predicate and priority functions run in parallel for different nodes. So, reading and writing the transient node info requires synchronization.

Good point. Till now, most of fields in struct are being updated only at one location. I will make sure that we have locking in place for transient node info.

ravisantoshgudimetla · 2018-03-29T18:12:13Z

/retest

ravisantoshgudimetla · 2018-03-29T20:04:13Z

/retest

ravisantoshgudimetla · 2018-03-29T20:06:33Z

@bsalamat I addressed your comments. PTAL.

bsalamat · 2018-03-30T17:34:22Z

pkg/scheduler/schedulercache/cache.go

You do not need the "if" part.

In fact, given that TransientInfo is initialized with nodeInfo, please remove the whole else part and not initialize it here.

Done Bobby. PTAL.

bsalamat · 2018-03-30T21:30:28Z

pkg/scheduler/schedulercache/node_info.go

One last thing. This method is not needed. I would remove and inline the body in resetTransientSchedulerInfo to prevent future mistakes of using it directly (without locking).

Updated. Thanks. I did not want to have a big function which resets all the transient values, that was the reason, I separated that block into different function. But I understand your point of calling this function directly without acquiring lock.

bsalamat

/lgtm

Thanks, @ravisantoshgudimetla!

k8s-ci-robot · 2018-03-30T23:25:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/features/OWNERS~~ [bsalamat]
~~pkg/scheduler/OWNERS~~ [bsalamat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ravisantoshgudimetla · 2018-03-31T00:03:26Z

/retest

k8s-github-robot · 2018-03-31T03:13:13Z

Automatic merge from submit-queue (batch tested with PRs 54997, 61869, 61816, 61909, 60525). If you want to cherry-pick this change to another branch, please follow the instructions here.

gnufied · 2018-05-03T16:48:01Z

@k82cn @msau42 @ravisantoshgudimetla @bsalamat I am sorry I did not get a chance to comment on this before, but one big problem with what we merged in this PR is - it will only work for Azure, AWS and GCE volume types. It will not work for any other volume type.

Also it sounds like this proposal set out to fix "balance PVCs on node", but in reality because it requires a maximum limit of volumes, it will only EVER work for volume types that have such upper limits. In a nutshell - this proposal will perhaps not work for something like glusterfs, iscsi etc for foreseeable future.

ravisantoshgudimetla · 2018-05-04T00:12:35Z

@gnufied I think this works for any volume type as long as max limit on volumes parameter is set.

it will only EVER work for volume types that have such upper limits.

I think the limit is coming from max number of volumes that could be attached to a node irrespective of volume types like glusterfs, iscsi etc. IOW, even if glusterfs provides unlimited volumes, there is a limit on the number of volumes that could be attached to a machine depending on Operating system running and filesystem used.

gnufied · 2018-05-04T15:12:21Z

The problem I was talking about is - since Kubernetes have no knowledge of applicable limits for volume types like glusterfs etc, the balanced volume allocation will not work for it. Even though, there indeed will be a theoretical limit for glusterfs etc (before it saturates the network), there is no way that information is exposed inside kubernetes. There is no way for admin to set those limits.

Discussed this offline with @jsafrane , @childsb and we think that - it may be possible to return a "dummy" upper limit for those volumes types (a high number such as 1000 lets say) - so as fractions that this proposal requires can still be calculated. Do we really care about real maximum volume count for purpose of calculating the fractions? It appears to me that - any high number should do the job and still this design will work? @ravisantoshgudimetla correct me if I am wrong.

kubernetes#60525 introduced Balanced attached node volumes feature gate to include volume count for prioritizing nodes. The reason for introducing this flag was its usefulness in Red Hat OpenShift Online environment which is not being used any more. So, removing the flag as it helps in maintainability of the scheduler code base as mentioned at kubernetes#101489 (comment)

k8s-ci-robot requested a review from aveshagarwal February 27, 2018 21:34

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Mar 1, 2018

wgliang reviewed Mar 2, 2018

View reviewed changes

k8s-ci-robot requested review from bsalamat and k82cn March 2, 2018 16:51

ravisantoshgudimetla force-pushed the scheduler-pvc branch 2 times, most recently from 901f801 to 6b7b83b Compare March 3, 2018 23:31

ravisantoshgudimetla changed the title ~~[WIP] Balanced resource allocation priority to include volume count on nodes.~~ Balanced resource allocation priority to include volume count on nodes. Mar 3, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2018

ravisantoshgudimetla force-pushed the scheduler-pvc branch from 6b7b83b to e958943 Compare March 4, 2018 01:00

bsalamat reviewed Mar 5, 2018

View reviewed changes

k82cn reviewed Mar 6, 2018

View reviewed changes

msau42 reviewed Mar 6, 2018

View reviewed changes

k82cn reviewed Mar 6, 2018

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 8, 2018

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2018

ravisantoshgudimetla force-pushed the scheduler-pvc branch from f15e3f8 to 8aa1abb Compare March 22, 2018 19:50

k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 22, 2018

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 29, 2018

bsalamat reviewed Mar 30, 2018

View reviewed changes

ravisantoshgudimetla force-pushed the scheduler-pvc branch from 1026928 to bc3680e Compare March 30, 2018 18:21

bsalamat reviewed Mar 30, 2018

View reviewed changes

ravisantoshgudimetla added 2 commits March 30, 2018 19:11

Include volume count while doing balanced resource allocation

ba827f8

Build files generated

2aaf85d

ravisantoshgudimetla force-pushed the scheduler-pvc branch from bc3680e to 2aaf85d Compare March 30, 2018 23:11

k8s-ci-robot assigned bsalamat Mar 30, 2018

bsalamat reviewed Mar 30, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2018

k8s-github-robot merged commit 9847c8e into kubernetes:master Mar 31, 2018

ravisantoshgudimetla deleted the scheduler-pvc branch March 31, 2018 14:36

gnufied mentioned this pull request Sep 26, 2018

Add a design proposal for dynamic volume limits kubernetes/community#2051

Closed

ravisantoshgudimetla mentioned this pull request May 31, 2021

Remove balanced attached node volumes #102443

Merged

Conversation

ravisantoshgudimetla commented Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravisantoshgudimetla commented Mar 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravisantoshgudimetla commented Mar 2, 2018

Uh oh!

ravisantoshgudimetla commented Mar 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravisantoshgudimetla commented Feb 27, 2018 •

edited

Loading

ravisantoshgudimetla Mar 6, 2018 •

edited

Loading