Delay deletion of pod from the API server until volumes are deleted#40239
Conversation
It can be in |
|
This PR mostly LGTM. Can you add the e2e test you created for this purpose to this PR? |
|
@vishh This change is tested by any test that creates or deletes pods. I created the test to make testing my change quicker, as it runs in ~1 minute. I can add it if you would like, but it is largely redundant. |
derekwaynecarr
left a comment
There was a problem hiding this comment.
thanks for doing this, just the one question on pending until cgroup is cleaned up as well, we could just block on PodContainerManager.Exists(pod)
| return false | ||
| } | ||
|
|
||
| func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool { |
There was a problem hiding this comment.
i thought @vishh wanted to pend on cgroup deletion as well?
There was a problem hiding this comment.
Yes, we should also do that. Is: "kl.containerManager.NewPodContainerManager().Exists(pod)" the correct statement?
| // Map from (mirror) pod UID to latest status version successfully sent to the API server. | ||
| // apiStatusVersions must only be accessed from the sync thread. | ||
| apiStatusVersions map[types.UID]uint64 | ||
| // A function which returns true if the pod can saftely be deleted |
|
i would prefer to put this function in |
| return false | ||
| } | ||
|
|
||
| func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool { |
There was a problem hiding this comment.
add some godoc here explaining what and why this exists as it will not be obvious to future folks... i know for a fact the relationship around the whole delete pod flow in kubelet is black magic to many.
| return false | ||
| } | ||
| if kl.podVolumesExist(pod.UID) && !kl.kubeletConfiguration.KeepTerminatedPodVolumes { | ||
| // We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes |
There was a problem hiding this comment.
is it worth logging here or is there sufficient logging elsewhere if/when the volume could not be cleaned up (suspect there is, but worth confirming)
There was a problem hiding this comment.
There is logging related to volume unmounting in pkg/kubelet/volumemanager/reconciler/reconciler.go. My only concern is that most debugging people wont make the connection that their pod is not being deleted because unmounting a volume failed. But I'm not sure how to notify users without logging every time they try to delete a pod (which is useless).
|
cc @kubernetes/sig-storage-misc |
|
We might want to trigger deletion of volumes by posting a signal to the
volume manager using channels for example. Not necessary for this PR, but
we should watch out for test timeouts once this PR goes in.
…On Fri, Jan 20, 2017 at 12:46 PM, k8s-ci-robot ***@***.***> wrote:
Jenkins verification *failed*
<https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/40239/pull-kubernetes-verify/13653/>
for commit 7ddb531
<7ddb531>.
Full PR test history <http://pr-test.k8s.io/40239>. cc @dashpole
<https://github.com/dashpole>
The magic incantation to run this job again is @k8s-bot verify test this.
Please help us cut down flakes by linking to an open flake issue
<https://github.com/kubernetes/kubernetes/issues?q=is:issue+label:kind/flake+is:open>
when you hit one in your PR.
Instructions for interacting with me using PR comments are available here
<https://github.com/kubernetes/community/blob/master/contributors/devel/pull-request-commands.md>.
If you have questions or suggestions related to my behavior, please file an
issue against the kubernetes/test-infra
<https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:>
repository. I understand the commands that are listed here
<https://github.com/kubernetes/test-infra/blob/master/prow/commands.md>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40239 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKLYyDMsHEaKbFiXDTlsEbYrFvrmvks5rUR0LgaJpZM4LpsS4>
.
|
When/how will kubelet trigger the pod deletion after the volumes are cleaned up? |
7ddb531 to
44ce93c
Compare
| // apiStatusVersions must only be accessed from the sync thread. | ||
| apiStatusVersions map[types.UID]uint64 | ||
| // A function which returns true if the pod can saftely be deleted | ||
| safeToDeletePod func(pod *v1.Pod) bool |
There was a problem hiding this comment.
I'd recommend passing an interface instead.
type PodDeletionSafety interface {
OkToDelete(pod *v1.Pod) (bool, error)
}for example. You can then pass the Kubelet object itself.
There was a problem hiding this comment.
Great idea. I will change that.
| } | ||
|
|
||
| func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool { | ||
| if kubepod.IsMirrorPod(pod) { |
There was a problem hiding this comment.
I dont have full context on how mirror pods work. This is taken directly from the old status manager code.
| glog.V(3).Infof("Pod %q is terminated, but some containers are still running", format.Pod(pod)) | ||
| return false | ||
| } | ||
| if kl.podVolumesExist(pod.UID) && !kl.kubeletConfiguration.KeepTerminatedPodVolumes { |
There was a problem hiding this comment.
We need to add pod cgroups here. I'm ok with adding that in a separate PR.
|
Sorry, I failed to post my comments earlier. |
I spoke to @dashpole offline. We need to verify that the deletion will still be triggered eventually before we can merge this PR. |
This is an issue, and I am exploring a couple solutions, and making sure I understand the problem. There actually is a test just like the one I made: test/e2e_node/simple_mount.go I havent been able to get cleaning up cgroups. Based on my tests, cleaning up volumes adds ~2 seconds to pod deletion time. Is this acceptable? |
a40b822 to
abda617
Compare
a7679a8 to
96cff41
Compare
|
All node e2e tests passed when I ran it in my own project, ill have to take a look at why the GCE etcd3 e2e test failed. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED The following people have approved this PR: vishh Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
|
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
|
@k8s-bot bazel test this |
|
@dashpole: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Automatic merge from submit-queue |
|
See #40824 for my first round of profiling (excluding this change). It appears that volume cleanup occurs almost immediately after deletion. Ill do some profiling including this change to see how the graph changes. |
|
@dashpole, just want to check whether you are still working on this PR? I reviewed a PR recently which might depends on this one. Thanks! |
|
I am still working on this. I think I have found why it added so much delay, but I need to confirm that. |
|
Thanks to @Random-Liu for helping me debug this. This graph shows deletion INCLUDING this change with a QPS limit of 60. As I have discussed with @vishh in the past, this change currently requires an additional round trip:
The problem that I did not realize existed is that round trips to the API server take an enormous amount of time when trying to update/delete 105 pods due to QPS constraints. This is easy to see when comparing the first and second graphs. |
|
To verify my theory, I did the following:
The following are the same as the graphs above, but including this change. It is clear that this change eliminates entirely the gap between containers being killed (RED) and volumes being cleaned up (LIGHT GREY). |
|
Since messing with the syncloop is a bad idea, I plan to modify the volume manager to use the status manager instead of the pod manager to determine when it can clean up a pod. This should have the same effect as the syncloop change since it eliminates the roundtrip to the API server |
Automatic merge from submit-queue [Kubelet] Delay deletion of pod from the API server until volumes are deleted Previous PR that was reverted: #40239. To summarize the conclusion of the previous PR after reverting: - The status manager has the most up-to-date status, but the volume manager uses the status from the pod manager, which only is as up-to-date as the API server. - Because of this, the previous change required an additional round trip between the kubelet and API server. - When few pods are being added or deleted, this is only a minor issue. However, when under heavy load, the QPS limit to the API server causes this round trip to take ~60 seconds, which is an unacceptable increase in latency. Take a look at the graphs in #40239 to see the effect of QPS changes on timing. - To remedy this, the volume manager looks at the status from the status manager, which eliminates the round trip. cc: @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc
|
@dashpole Could you explain more about your fix? Since messing with the syncloop is a bad idea, I plan to modify the volume manager to use the status manager instead of the pod manager to determine when it can clean up a pod. This should have the same effect as the syncloop change since it eliminates the roundtrip to the API server Which "syncloop" you mean in the above comment? What is the status manager? |
|
The syncloop is the core kubelet thread that handles all updates to pod state. I wanted to avoid modifying it because it can have a lot of unforeseen impacts. The status manager is a kubelet component that is responsible for processing pod status updates, and propagates those updates to the API server. The pod manager is another kubelet component that keeps internal state about pods. The pod manager is updated when the kubelet receives a pod from the API server. In general, it is better to refer to the status manager for the most up-to-date pod status. However, the volume manager was checking to see if a pod was terminated based on the pod manager. This meant that in order for the volume manager to clean up a pod, the status change indicating containers had died had to propegate: -> status manager -> api server -> pod manager -> volume manager. |
|
I am investigating a bug possibly produced by this PR. I see the following in the kubelet logs during runs that are >2 minutes long:
These lines are followed by a long period of inaction on the pod, and finally a syncloop that updates the status to failed (correct), volumes are unmounted, and then pod deletion. To summarize: |
|
I have been unable to replicate this on head, so it must be a problem introduced by this PR |
|
Mystery solved! |
Automatic merge from submit-queue Fix bug in status manager TerminatePod In TerminatePod, we previously pass pod.Status to updateStatusInternal. This is a bug, since it is the original status that we are given. Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially. This was the case in #40239 and #41095. As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected. This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?). @kubernetes/sig-node-bugs @vish @Random-Liu
Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416) Delay Deletion of a Pod until volumes are cleaned up #41436 fixed the bug that caused #41095 and #40239 to have to be reverted. Now that the bug is fixed, this shouldn't cause problems. @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc






Depends on #37228, and will not pass tests until that PR is merged, and this is rebased.
Keeps all kubelet behavior the same, except the kubelet will not make the "Delete" call (kubeClient.Core().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)) until the volumes associated with that pod are removed. I will perform some performance testing so that we better understand the latency impact of this change.
Is kubelet_pods.go the correct file to include the "when can I delete this pod" logic?
cc: @vishh @sjenning @derekwaynecarr