Balanced resource allocation priority to include volume count on nodes.#60525
Conversation
|
/sig scheduling |
There was a problem hiding this comment.
Variance's value may be zero, there may crash, you'd better verify it.
There was a problem hiding this comment.
The only way for it to happen is requested memory, cpu etc are 0's or < 0, in either case we won't be reaching any of the priority functions but I understand your point. I will add a check.
901f801 to
6b7b83b
Compare
6b7b83b to
e958943
Compare
|
/retest |
pkg/features/kube_features.go
Outdated
There was a problem hiding this comment.
nit: I would rename to BalanceAttachedNodeVolumes and would update the comment to say that this impacts scheduling decisions and scheduler tries to place a Pod that requires an attached volume on a Node with fewer volumes.
There was a problem hiding this comment.
Do you care about counting bind mounts? There is always a bind-mount (done by the container runtime) for all the volumes referenced in the container spec.
At the pod level, you can share between pods, but the number of mounts depends on the volume type. Sharing between pods on the same node is not as common. Typically, scheduler tries to spread pods across nodes.
- In the case of NFS, each pod has its own NFS mount.
- For something like GCE PD (sharing should be a rare use case), you will have only one global node mount, but a bind mount per pod.
- Volume types like EmptyDir, ConfigMap, Secrets, downward api are not mounted
- Hostpath volumes can be mounted or not, we can't tell from the spec.
There was a problem hiding this comment.
This is a function used in testing, thats why I did not bother checking that, if you think its important, I can add a check.
There was a problem hiding this comment.
Thanks Ravi for pointing out. I missed the fact that it is a test only function. We don't need to change it.
There was a problem hiding this comment.
NodeInfo is a part of scheduler cache, but we update the information in this predicate. This is not a good idea. If the prediate is not executed, the information in the cache gets stale. This could lead to subtle bugs. Why don't we update the values in NodeInfo.AddPod() and .RemovePod()?
There was a problem hiding this comment.
Actually, we are not caching anything now. For example, say we chose node n1 during pod p1's scheduling. When we are scheduling next pod p2, we are going to recompute the volumes attached, max number of volumes again. As I mentioned in the comment, for lack of proper data structure through we can share information across predicates and priorities, I am using it. I don't say that this clear or better. I am open to suggestions.
There was a problem hiding this comment.
Actually, we are not caching anything now.
I understand you don't need to cache anything and this is just a way of passing information around, but the nodeInfo pointer is a part of scheduler's cache. When you update the fields of the struct, you are effectively keeping the information in the scheduler's cache. The existance of such information in the scheduler cache would be confusing for someone who reads the code later.
Ideally, we should keep this in some temporary state that is valid in a single schedule cycle and is destructed at the end of the cycle, but we don't have such a structure and we need to add it if needed.
There was a problem hiding this comment.
Ok let me think of some alternate ways to do this and get back to you. Thanks again for the review.
There was a problem hiding this comment.
Agree with Bobby, prefer not to update NodeInfo outside of cache.
There was a problem hiding this comment.
variance could be larger than 1, causing 1 - variance (below) to be negative. I guess you have forgotten to divide by 3.
There was a problem hiding this comment.
I guess you have forgotten to divide by 3.
🤣 Lol, thanks for catching it :)
variance could be larger than 1, causing 1 - variance (below) to be negative.
I thought about that, but it seems our sample set always has fractions and
- fraction-mean(which is a fraction) would be fraction again.
- square-root of (fraction-mean) will be further lower than original fraction
Not sure if I missed any edge case.
There was a problem hiding this comment.
You are right, but given that you have not divided by 3, it could be larger than 1. For example, if each component is 0.9, then variance will be 0.81 + 0.81 + 0.81. If you divide by 3, variance will be less than 1.
There was a problem hiding this comment.
So, you were commenting on my version of variance(without division by 3) I thought you were commenting on variance in general that it could be greater than 1(which is true as well). Sure, I will make that change.
There was a problem hiding this comment.
Yes, I was referring to this implementation.
There was a problem hiding this comment.
Here's the explanation of variance, not sure why we use it?
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.
There was a problem hiding this comment.
So, we wanted to know how close CPU, memory utilization, no of volumes on nodes are to each other. The closer means the node is well balanced. I tried explaining the same here-#58232 (comment)
I could have used standard deviation as well but I thought of stopping at variance.
There was a problem hiding this comment.
Maybe standard deviation is better; it'll also address Bobby's comments above :)
There was a problem hiding this comment.
Just deviding by 3 addresses my comment and since we want to compare one node against others, standard deviation and variance both work equally well here. Given that taking square root is computationally expensive, I would just use variance.
There was a problem hiding this comment.
Yes, computational cost was the reason that I stopped going towards standard deviation.
There was a problem hiding this comment.
This volume predicate only counts for 3 volume types, and only one of them can run in a single cluster.
- GCE PD
- AWS EBS
- Azure disk
Did you also want to count mounts for other volume types not included here, like NFS, Gluster, Ceph?
Also, the limits here are more about how many disks can be attached to a single node, not how many could be mounted. So if you had multiple pods sharing the same volume on the same node, this predicate would only count 1 of them, even though there will be a mount for each pod. But multiple pods sharing the same volume on a single node is not the common use case.
There was a problem hiding this comment.
In the future, it may be possible that you can have multiple volume types in the same cluster that each have their own max limit as well.
There was a problem hiding this comment.
Thanks for the information @msau42. I thought we are going in the direction of using a configmap which consists of max number of volumes per node based on cloud provider and CSI plugin, after which we can use configmap to get the information.
There was a problem hiding this comment.
I think the main challenge is that this predicate is only meant to restrict max number of attached volumes, not overall number of mounts. If you want to spread number of attached volumes, then this predicate will work.
But it sounds like you want to spread out number of mounts, including for non-attachable plugins.
If that's the case, then you may need to consider a different method of counting mounts, such as just counting the total number of volumes in assigned pods. But that still won't be completely accurate because each volume type may create a different number of mounts, like I mentioned earlier.
There was a problem hiding this comment.
But it sounds like you want to spread out number of mounts, including for non-attachable plugins.
So, we wanted to find nodes that have balanced number of volumes here. Not sure where this PR gave the impression of spreading mounts. While finding node that has balanced mounts will be even more useful, I think it has to be dependent on lower level API talking to cloud provider or listing /etc/mtab in case of linux etc and is beyond the scope of what we want to do here.
There was a problem hiding this comment.
Maybe improving the wording in the comments here to explicitly say we're balancing attachable volume cpunts would make it more clear.
Apologies if the comments are causing confusion. I thought I made it clear atleast from the start of PR, if you think otherwise, I can edit the existing comments. Please let me know, where the confusion is. Even if it is in code, I am happy to modify it.
There was a problem hiding this comment.
Maybe we can be more explicit in some of the variable names. Like, instead of AllocableVolumes, something like AttachableVolumesLimit.
There was a problem hiding this comment.
I don't think, AttachableVolumesLimit actually reflects the allocatable volumes on the node. My intention there was to have a variable which holds the value for number of allocatable volumes currently on the node. AttachableVolumesLimit may be more suited for c.maxVolumes. I don't have strong opinion on it though.
There was a problem hiding this comment.
Oh sorry, yes, what about something like "CurrentAttachedVolumesCount"? I want to be explicit that we are talking about attachable counts here. There is another completely unrelated volume feature that uses the term "allocatable", so it would be better to align with the terminology that we use in the documentation and API fields.
There was a problem hiding this comment.
Sure. I can do that.
There was a problem hiding this comment.
is it possible to also merge volumes into Resource? so no interface changed.
There was a problem hiding this comment.
I thought about it but it comes mostly under scalar resource and I think there is no definitive way on kubelet side to introspect node and get that information but it would be ideal to have that kind of information there. @bsalamat WDYT?
There was a problem hiding this comment.
is it possible to also merge volumes into Resource? so no interface changed.
That would require updating the scheduler cache with the volume information and that in turn would require Kubelets to report volume counts, etc. as a part of allocatable resources. This would be a larger project requiring involvement of the node team. I won't go this path, if we can solve the problem without changes to the kubelet.
There was a problem hiding this comment.
@gnufied is working on a design for dynamic node reporting of attachable volume limits per plugin. So I think when that is implemented, the limits will actually will be in the Node object.
There was a problem hiding this comment.
I think when that is implemented, the limits will actually will be in the Node object.
That will be great and will make our logic cleaner.
There was a problem hiding this comment.
So I think when that is implemented, the limits will actually will be in the Node object.
Thanks @msau42 . That will be awesome. Ohh btw, then we could actually cache in nodeinfo :).
There was a problem hiding this comment.
That may related to our previous comments: did not update nodeInfo outside of cache :)
I think when that is implemented, the limits will actually will be in the Node object.
That will be great and will make our logic cleaner.
Great, @gnufied any related doc/issue?
There was a problem hiding this comment.
|
@bsalamat @k82cn I have attempted to create a transient data structure which should live only for duration of a single scheduling cycle. Commit 7359fe2794051fdabab8abaa359a343d2fefe74c has it. Sample usage of structure could be found at https://github.com/ravisantoshgudimetla/kubernetes/blob/c6b9f133fe381f91394df02b6cf0f4c41dd5d6fe/pkg/scheduler/algorithm/priorities/resource_allocation.go. Please let me know your thoughts on the approach. |
|
@ravisantoshgudimetla
|
|
Thanks @bsalamat
Thats why PR has grown to 39 files and counting :(.
I have started on similar lines but putting this struct in nodeinfo struct would be cleaner. I will use this approach.
Good point. Till now, most of fields in struct are being updated only at one location. I will make sure that we have locking in place for transient node info. |
f15e3f8 to
8aa1abb
Compare
|
/retest |
1 similar comment
|
/retest |
|
@bsalamat I addressed your comments. PTAL. |
There was a problem hiding this comment.
You do not need the "if" part.
In fact, given that TransientInfo is initialized with nodeInfo, please remove the whole else part and not initialize it here.
There was a problem hiding this comment.
Done Bobby. PTAL.
1026928 to
bc3680e
Compare
There was a problem hiding this comment.
One last thing. This method is not needed. I would remove and inline the body in resetTransientSchedulerInfo to prevent future mistakes of using it directly (without locking).
There was a problem hiding this comment.
Updated. Thanks. I did not want to have a big function which resets all the transient values, that was the reason, I separated that block into different function. But I understand your point of calling this function directly without acquiring lock.
bc3680e to
2aaf85d
Compare
bsalamat
left a comment
There was a problem hiding this comment.
/lgtm
Thanks, @ravisantoshgudimetla!
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, ravisantoshgudimetla The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
Automatic merge from submit-queue (batch tested with PRs 54997, 61869, 61816, 61909, 60525). If you want to cherry-pick this change to another branch, please follow the instructions here. |
|
@k82cn @msau42 @ravisantoshgudimetla @bsalamat I am sorry I did not get a chance to comment on this before, but one big problem with what we merged in this PR is - it will only work for Azure, AWS and GCE volume types. It will not work for any other volume type. Also it sounds like this proposal set out to fix "balance PVCs on node", but in reality because it requires a maximum limit of volumes, it will only EVER work for volume types that have such upper limits. In a nutshell - this proposal will perhaps not work for something like glusterfs, iscsi etc for foreseeable future. |
|
@gnufied I think this works for any volume type as long as max limit on volumes parameter is set.
I think the limit is coming from max number of volumes that could be attached to a node irrespective of volume types like glusterfs, iscsi etc. IOW, even if glusterfs provides unlimited volumes, there is a limit on the number of volumes that could be attached to a machine depending on Operating system running and filesystem used. |
|
The problem I was talking about is - since Kubernetes have no knowledge of applicable limits for volume types like glusterfs etc, the balanced volume allocation will not work for it. Even though, there indeed will be a theoretical limit for glusterfs etc (before it saturates the network), there is no way that information is exposed inside kubernetes. There is no way for admin to set those limits. Discussed this offline with @jsafrane , @childsb and we think that - it may be possible to return a "dummy" upper limit for those volumes types (a high number such as 1000 lets say) - so as fractions that this proposal requires can still be calculated. Do we really care about real maximum volume count for purpose of calculating the fractions? It appears to me that - any high number should do the job and still this design will work? @ravisantoshgudimetla correct me if I am wrong. |
kubernetes#60525 introduced Balanced attached node volumes feature gate to include volume count for prioritizing nodes. The reason for introducing this flag was its usefulness in Red Hat OpenShift Online environment which is not being used any more. So, removing the flag as it helps in maintainability of the scheduler code base as mentioned at kubernetes#101489 (comment)
Scheduler balanced resource allocation priority to include volume count on nodes.
/cc @aveshagarwal @abhgupta
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #58232
Release note: