proposal for live and in-place vertical scaling#1719
proposal for live and in-place vertical scaling#1719YoungjaeLee wants to merge 3 commits intokubernetes:masterfrom
Conversation
|
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Sorry, please disregard my previous comment. I thought this was another VPA design. |
| Also, the 'API server' restores the resource requirements of each container of the PodSpec to the original and writes the revised PodSpec to ETCD to communicate with the Scheduler. | ||
| This is because at this moment the PodSpec on ETCD shouldn’t be updated with new resource requirements. | ||
|
|
||
| For a pod with ResizeRequested, the 'Scheduler' checks if the node on which the pod currently runs has enough resources to resize the pod. |
There was a problem hiding this comment.
This is slightly more complex. With the addition of priorities, a resource request change for a high priority pod should preempt low priority pods if needed.
There was a problem hiding this comment.
See https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ for more details.
There was a problem hiding this comment.
Thanks for pointing it. Currently, the pod-priority isn't covered in this proposal as it was not available when we started this work, but, we're planing to support pod-priority. Basically we think that in vertical scaling, the scheduler also should follow the current policy as you mentioned. Now, I'm looking at source codes to understand how the scheduler works with pod-priority, and once we figure out how to support pod-priority in vertical scaling, we'll update the proposal accordingly.
|
|
||
| type ResizeRequest struct { | ||
| RequestStatus ResizeStatus | ||
| NewResources []ResourceRequirements // indexed by containers’ index |
There was a problem hiding this comment.
This should not be rely on pod index. Use container name.
|
|
||
| If used, this would be included as part of the patch or appropriate update command providing the spec update for the resize. | ||
| It would indicate the preference of user at the time of resize. | ||
| Specifically, Restart for `resizeAction` would indicate the pod be restarted for the corresponding resizing of resource(s), LiveResize would indicate the pod not be restarted the resize be realized live, and LiveResizePreferred would indicate that the resize be realized preferrably live but if that fails for any reason to accomplish it with a restart. |
There was a problem hiding this comment.
Preferred method should go ResizeRequest.
There was a problem hiding this comment.
The preferred method is specified by clients like a user or a controller. But, 'ResizeRequest' is not a field that is intended to change by clients. It is for holding new values of ResourceRequirement and ResizeStatus, which are managed by the apiserver. Could you tell me why you think the preferred method should go ResizeRequest ?
| ResizedAccepted and ResizedRejected means that the requested resource resizing is accepted and rejected, respectively, by the Scheduler. | ||
| The NewResources is an array indexed by a container’s index and its each entry holds new resource requirements of a container that needs to resize. | ||
|
|
||
| Given a new PodSpec with new resource requirements from a client, first the 'API server' validates it. |
There was a problem hiding this comment.
Who and when removes/cleans ResizeRequest in a pod?
There was a problem hiding this comment.
the same doubt here, the kubelet will remove or the custom controller will do?
There was a problem hiding this comment.
In the current implementation, after resizing is accepted by the scheduler, the apiserver cleans NewResource of ResizeRequest. But, the ResizeStatus remains to be "ResizedAccepted".
| type Resizing struct { | ||
| metav1.TypeMeta | ||
| metav1.ObjectMeta | ||
| Request ResizeRequest |
There was a problem hiding this comment.
I guess we should follow spec+status approach in all api objects.
There was a problem hiding this comment.
for me, I would regard Resizing as kind of action not resource, a little bit tricky to define this struct.
There was a problem hiding this comment.
and we may build command like kubectl resize?
There was a problem hiding this comment.
As @adohe mentioned, "Resizing" is a kind of action(?) like Binding, which has the same fields.
There was a problem hiding this comment.
We think that resizing is just another pod spec update, so prefer to utilize the current podspec update methods like kubectl apply, patch, etc. instead of introducing a new kubectl command.
There was a problem hiding this comment.
We think that resizing is just another pod spec update, so prefer to utilize the current podspec update methods like kubectl apply, patch, etc. instead of introducing a new kubectl command.
I agree with this
currently, the fact that a node is running a pod implies it has accepted the resource requests the pod made.
this is proposing introducing a state in which a node is running a pod with one set of resources, but may or may not have accepted the pod's currently requested resources. that seems to imply:
- resources should become part of the pod/container status
- when spec....resources do not match status...resources, then the node has not seen/accepted/adjusted to the newly requested resources
- the node should have some way in status to indicate it has seen, but has refused to accept, the newly requested resources, and why
There was a problem hiding this comment.
@liggitt I would think that container status should include current resources, still don't understand why CRI ContainerStatus definition not include container resources.
| } | ||
| ``` | ||
|
|
||
| Resizing has the metadata of a pod to resize and a value of ResizeRequest that holds the status of a resizing request, which indicates whether the resizing is feasible or not, and new resource requirements of the pod. |
There was a problem hiding this comment.
Why there is ResizeRequest in PodSpec and as a separate object?
There was a problem hiding this comment.
I think I don't get it. Are you asking about why we need ResizeRequest in Resizing, which is already there in PodSpec ??
There was a problem hiding this comment.
Yes, why is it in both places.
| ConditionRequested ConditionStatus = "Requested" | ||
| ConditionAccepted ConditionStatus = "Accepted" | ||
| ConditionRejected ConditionStatus = "Rejected" | ||
| ConditionDone ConditionStatus = "Done" |
There was a problem hiding this comment.
How long is the pod status kept?
There was a problem hiding this comment.
The ConditionRequest and ConditionAccepted are intermediate states. So, if the scheduler and kubelet are operational, the pod status will be changed to ConditionRejected or ConditionDone. Once the pod status becomes ConditionRejected or ConditionDone, it won't be changed unless a new resizing request comes in.
derekwaynecarr
left a comment
There was a problem hiding this comment.
I want this feature, but I prefer we tackle this in 1.11+ as hugepages, device plugins, and exclusive cores are finished. I am also not sure I like that the original resource requirement values are lost prior to the kubelet acknowledging that a resize is complete. I wonder if we added a new resource requirement section that allowed edits, and added a new sub resource that only the kubelet was allowed to use to change the original resource requirement once realized would work better.
| * LiveResizeable. | ||
|
|
||
| This attribute will be available per resource (such as cpu, memory) and so is adequate to indicate whether the workload can handle and prefer a change in each resource’s allocation for it without restarting. | ||
| With potentially multiple containers and multiple resizeable resources for each in a Pod, the response to an update of the pod spec will be determined by the a precedence order among the attribute values with RestartOnly dominating LiveResizeable, i.e., if two resources have been resized in the update to the spec and one of them has a policy of RestartOnly then the pod would be restarted to realize both updates. |
There was a problem hiding this comment.
For clarity, can you distinguish Container restart versus pod restart? From the yaml below, I think you are referring to Container restart since that is all that has resource requirements. Pod restart terminology to me brings extra baggage for things like init containers.
There was a problem hiding this comment.
I am also wondering this, a Pod is a collection of containers, if we just adjust resource for container A, shall we restart the Pod or just restart the container A?
There was a problem hiding this comment.
Actually, our prototype implementation supports container restart (at pod-level resizing). But, we thought that in this proposal it is better to focus on adding a live-and-in-place resizing feature at StatefulSet level, because if we go with container start as well, it seems like to introduce too much things at one time. But, we definitely see the value of Container restart. So, if most of you agree/want to add a Container restart option, we're happy to do that.
There was a problem hiding this comment.
I could see more value if we could support Container level restart.
| memory: RestartOnly | ||
| ``` | ||
|
|
||
| For the above example, if there is a change to cpu request or limit it can be vertically scaled only if the memory request and limit remained the same, otherwise the RestartOnly policy for memory would override the policy for CPU, and the Pod (container, if container-alone restart is allowed) would need to be restarted. |
There was a problem hiding this comment.
I am inclined to deny a resize request that changed the pods QoS class. Noting here also concerns about desired G pod characteristics for NUMA, exclusive cores (if on a node with static cpu manager), hugepages, and devices. It seems for some of those options RestartOnly is the only viable option.
There was a problem hiding this comment.
Container storage is another area
There was a problem hiding this comment.
for exclusive cores, why RestartOnly should be the only viable option?
There was a problem hiding this comment.
what if I update pod spec to change its QoS class from BestEffort to Guaranteed?
There was a problem hiding this comment.
As mentioned below at 'Related issues' section, QoS class change by live-and-in-place resizing is not supported because in order to change QoS class of a pod, the parent directory of the cgroup directory of the pod also need to change, which is not possible.
There was a problem hiding this comment.
For NUMA, CPU manager, and other stuffs which were introduced after 1.7, still we're working to make sure that our vertical scaling feature works well with them. Perhaps, could you provide some insights on problems that might come ??
| limits: | ||
| cpu: 1000m | ||
| memory: 1Gi | ||
| resizePolicy: |
There was a problem hiding this comment.
I think for backwards compatibility we may want to assume a default resize policy is “no” for all resources unless explicitly specified.
There was a problem hiding this comment.
IIUC, a default RestartOnly resize policy should still keep backwards compatibility.
There was a problem hiding this comment.
This is about a resize policy for pods associated with a statefulset. So, As @adohe mentioned, I think the default should be restart.
There was a problem hiding this comment.
Do we need compat here? Resources are not mutable today, right? I'd hope we can make resize-in-plase-if-possible the default.
There was a problem hiding this comment.
For now, resources are immutable, and we need to soften the validation if we decide to support resize.
| ## Desired Approach | ||
|
|
||
| Any valid method to update the pod spec should be applicable for vertical scaling, e.g., using kubectl commands set, patch, apply, edit. | ||
| Logs associated with the pod will capture the failure/success of the resize command. |
There was a problem hiding this comment.
I am not sure I follow this. Can you elaborate?
There was a problem hiding this comment.
I am also confused about this, use logs to capture the resize command result? any consider of PodCondition or Events?
There was a problem hiding this comment.
Basically, we see resizing a pod as a just another pod spec update. So, such resizing is requested by the existing spec update methods like kubectl patch or apply, which are currently used to change statefulset specs.
What we mean by 'Logs' is general logs/messages generated by k8s components (like apiserver, kubelet, scheduler, and so on) and that we check with kubectl describe. And yes, the status of resizing can be also checked with a new PodCondition, called PodResized.
| Any valid method to update the pod spec should be applicable for vertical scaling, e.g., using kubectl commands set, patch, apply, edit. | ||
| Logs associated with the pod will capture the failure/success of the resize command. | ||
| The controller will continue to attempt the update to the spec while there is a difference between the current size and size in updated spec. | ||
| If an update is partially successful, user can know this from the logs and attempt to rectify the situation by submitting new updates (that can restore original size(s) or go for a size feasible on all the nodes). |
There was a problem hiding this comment.
Is this their literal Container application log?
There was a problem hiding this comment.
No, it's logs/messages generated by k8s components.
| The pod condition of PodResized represents the status of the resizing process. | ||
| The PodResized Condition is updated by the `Kubelet` according to the ResizeStatus, which is updated by the 'API server'. | ||
|
|
||
| Basically, when the ResizeStatus is changed, the `Kubelet` updates the PodResized condition accordingly. |
There was a problem hiding this comment.
What happens if it’s rejected? Is the resource reset or does it look like the resource is consumed to the scheduler ?
There was a problem hiding this comment.
In the current design, the rejection here is made only by the scheduler. So, the resource allocation/consumption isn't changed at this point (because the resizing is rejected.) The kubelet just updates the PodResized condition accordingly.
|
|
||
| * **A new additional hash, called `expectedHashNoResources`, added for `Kubelet` to detect a change on resource requirements** | ||
|
|
||
| In order to watch resource requirement changes efficiently, a new additional hash is added to kubecontainer.ContainerStatus (and that is also stored as a one of the container’s labels). |
There was a problem hiding this comment.
I am confused here. It feels like we are trying to combine two pieces of information in a single field, and this hash is the outcome. We are basically saying we need a desired resource requirement (what we have today and what we are looking to change) and an actual enforced resource requirement which kubelet acknowledged.
There was a problem hiding this comment.
Uses of persisted hashes have proven problematic between releases and in the face of version skew. Are we sure we want to add another one?
There was a problem hiding this comment.
I understand that the original hash is introduced to detect the pod spec changes in an efficient way, like avoiding one-by-one field comparisons. So, I want to keep this behavior, meaning not to cause additional overhead in pod spec changes detection even when the resource requirements in a pod spec is mutable. So, basically, with this new hash, the changes of other fields in a pod spec can be detected by simple has comparison.
|
|
||
| # Related issues | ||
|
|
||
| 1. QoS class change by resize is not supported. |
There was a problem hiding this comment.
Ok, we should call this out earlier in the doc :-)
|
|
||
| 2. Memory-resizing to change a request value might not take effect for Burstable pods. | ||
|
|
||
| For Burstable pods, a request value for memory resource determines the value of a score for the OOM killer, but Docker doesn’t support to change dynamically the score of an existing container. |
There was a problem hiding this comment.
aside from docker, we need to think of other runtimes and platforms like windows
| So, the change to a memory request value doesn’t take effect for Burstable pods on the OOM killer’s behavior. | ||
| But, for Guaranteed and Best-effort pods, this is not an issue because the score is fixed regardless of its memory request value. (in this case, the memory request value is used only for admission control by Scheduler) | ||
|
|
||
| 3. Memory-resizing to decrease its limit may fail on the Kubelet in some circumstances. |
There was a problem hiding this comment.
We have a control loop that you can enable at the QoS tier to keep trying to induce reclaim. I worry that these issues unless really well understood will cause confusion for users and operators.
|
|
||
| Resources needed by containers can change over time for a variety of reasons - moving from live-test mode to production usage, change in user load or dataset sizes each of which again might come about for a variety of reasons. | ||
| `Statefulset` supports the capability to change Request and Limit values specified for a container through supported pod spec update methods. | ||
| However, they currently require the pods be restarted to run with the new resource sizes. |
There was a problem hiding this comment.
IIRC, if change Requests and Limits values for a container of Statefulset, it will kill the old pod and create a new one with new resource requirement.
There was a problem hiding this comment.
Yes, the current Statefulset controller recreates pods if its resource allocation is changed. So, in this proposal, we modify the controller to leverage the proposed live resizing feature.
| resizePolicy: | ||
| cpu: LiveResizeable | ||
| memory: RestartOnly | ||
| ``` |
There was a problem hiding this comment.
also any consideration of extended resource?
There was a problem hiding this comment.
No, for now, other resources are not in consideration.
| resizePolicy: | ||
| cpu: LiveResizeable | ||
| memory: RestartOnly | ||
| ``` |
There was a problem hiding this comment.
I am a little bit confused about the resizeAction annotation, under which case should I respect this value? and how to coordinate resizeAction with resizePolicy?
There was a problem hiding this comment.
The resizePolicy is a kind of the characteristics of resource that describes whether the resource can be resized live or not. The resizeAction is a way that a user/client want to take when resizing the resource. So, for example, even though CPU resource is specified LiveResizeable, but for some reason, if a user specifies the resizeAction for CPU to Restart, resizing of CPU is done by restart. Of course, for a resource whose resizePolicy is RestartOnly, its resizeAction cannot be LiveResize/LiveResizePreferred.
There was a problem hiding this comment.
I see your point, thanks for kindly explanation.
|
|
||
| ## Desired Approach | ||
|
|
||
| Any valid method to update the pod spec should be applicable for vertical scaling, e.g., using kubectl commands set, patch, apply, edit. |
There was a problem hiding this comment.
which means we need to broaden pod update validation.
There was a problem hiding this comment.
Yes, I modified podspec update validation accordingly.
|
|
||
| Any valid method to update the pod spec should be applicable for vertical scaling, e.g., using kubectl commands set, patch, apply, edit. | ||
| Logs associated with the pod will capture the failure/success of the resize command. | ||
| The controller will continue to attempt the update to the spec while there is a difference between the current size and size in updated spec. |
There was a problem hiding this comment.
here current size means the cached pod spec?
| ResizeAccepted ResizeStatus = “Accepted” | ||
| ResizeRejected ResizeStatus = “Rejected” | ||
| ResizeNone ResizeStatus = "None" | ||
| ) |
There was a problem hiding this comment.
suggest add some comments here, have no idea what ResizeNone mean
There was a problem hiding this comment.
This embedded "transaction" is a very different concept from most everything else, isn't it? Do we have precedent for this? It needs to be throroughly considered for API..
There was a problem hiding this comment.
To be honest, I could not find any precedent, this is just what I worry about @deads2k could you please give thoughts here?
There was a problem hiding this comment.
I feel more like the semantic is "the resource request for this pod WILL change, or the pod will die". The transaction is not needed
| } | ||
| ``` | ||
|
|
||
| ResizeRequest has two variables, RequestStatus and NewResources. |
There was a problem hiding this comment.
contains RequestStatus in spec? Seems like an anti-design
|
@bgrant0607 @smarterclayton @deads2k would you please take a look at this proposal? I think we did see real needs of in-place Pod resource update. |
| A new API, Resizing, for the 'scheduler' is introduced: | ||
|
|
||
| ```go | ||
| // Resizing resizes the resources allocated to a pod |
There was a problem hiding this comment.
if Resizing just resizes pod resource, this could be a subResource of pod.
|
|
||
| # API and Usage | ||
|
|
||
| To express the policy for resizing in a pod spec, we introduce resource attribute `resizePolicy` with the following choices for value: |
There was a problem hiding this comment.
Maybe I missed the discussion, but do we really need this to be so configurable? Or can we make the default "say nothing" be the best? E.g. If the user says nothing, always try to resize in place. If that can't be achieved, which can't always be known statically, then do a pod restart.
If the user NEEDS a restart, they can specify that, but make it the abnormal case.
?
There was a problem hiding this comment.
Depending on the nature of workloads and resources, their feasible policy is different so the policy needs to be configurable by users. For example, we can't resize the size of memory allocated to JVM dynamically, so in this case, the policy needs to be restartOnly. Also, some workload doesn't tolerate restarts, so in such case, the policy is to be configured as LiveResizeOnly.
|
|
||
| Resizing has the metadata of a pod to resize and a value of ResizeRequest that holds the status of a resizing request, which indicates whether the resizing is feasible or not, and new resource requirements of the pod. | ||
|
|
||
| Once the 'Scheduler' determine whether a resource resizing on a pod is feasible, or not, it notifies to the API server via this Resizing API. |
There was a problem hiding this comment.
Could we add a policy that scheduler just waits for another retry to determine instead of rejecting it immediately? And user could specify this policy in ResizeRequest.
|
Thanks for doing this, overall LGTM, only thing I'd suggest is to add VPAs as one of the primary use cases. We would really benefit from this, a lot. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Replaced action options, LiveResize and LiveResizePreferred, with new ones, InPlaceResize and InPlaceResizePreferred. Revised the process of pod resizing so that a pod spec would be updated with new resource configurations after Kubelet confirms that the actual change of resource allocation is succefully made at container-level.
89fb419 to
a9aab4e
Compare
|
@YoungjaeLee any more progress about this? I am revisiting this proposal, thinking about whether the scheduler needs to participant in the resize progress, kubelet will do this check before accept the pod resource update. |
|
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@adohe - we have an implementation that is consistent with the proposal as described on GitHub currently that works with the new priority feature. The scheduler does participate in the resizing to ensure a consistent picture of cluster resources exist at the scheduler at all times. However, we haven't seen the clearance/go-ahead from @derekwaynecarr though @YoungjaeLee has responded to his queries and updated proposal accordingly. @derekwaynecarr - now that 1.11 is out, I hope this proposal can be taken up and our implementation examined for inclusion. |
|
Is this proposal still alive? |
@thockin, yes this is an active proposal. We have addressed changes requested earlier and incorporated in our prototype available on GitHub, waiting on clearance/acceptance to know the next steps. |
|
What's the state of this feature? Is there already some partial support for this in 1.12? |
@fabiand, update of the resource requests and limits with restart is supported but not live, in-place update. |
|
Thanks @karthickrajamani - Are there still plans to do it in-place? |
Yes, we currently have a working prototype and would like to see this show up in 1.13 or latest 1.14. |
|
That's cool.
We are looking forward to this and aim at doing KubeVirt VM CPU and memory
hot-plugging based on this feature - just fyi.
|
|
Hi there, @YoungjaeLee, @karthickrajamani, first of all thanks for this proposal. I'm part of the Vertical Pod Autoscaler team. I really like your proposal but we'd also like to make it a bit more generic, additionally taking into account ideas from https://docs.google.com/document/d/18K-bl1EVsmJ04xeRq9o_vfY2GDgek6B6wmLjXw-kos4/edit?ts=5b96bf40 (see also https://groups.google.com/forum/#!msg/kubernetes-sig-scheduling/UnIhGOKpohI/VtUfVWgFBwAJ) To that end we've come up with the third proposal and I've started a KEP (#2908) to make the whole process more formal & visible. The core idea behind the proposal in KEP is to make PodSpec mutable with regards to Resources, denoting desired resources. Additionally PodStatus is extended to provide information about actual resource allocation. I'd love you to have a look there and let me know what you think of our idea, in particular whether it fits your use-case. Oh, and if you'd like to co-author that KEP it would be great, after all the KEP builds on this very PR. Regards, |
|
Please close this and either fold it into the related KEP or create a new KEP in the enhancements repo. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/close |
|
@bgrant0607: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is the proposal for live and in-place vertical scaling.
The related feature is at kubernetes/enhancements#21.
Also, the original proposal that has been presented at the resource-management WG is at https://docs.google.com/document/d/1Q_Aq4khL2Kjbvmwzok8jukF2HQl31LyM65jK93yfTII/edit?ts=59834e18#heading=h.qb2t0ak4gvki.