Delay deletion of pod from the API server until volumes are deleted by dashpole · Pull Request #40239 · kubernetes/kubernetes

dashpole · 2017-01-20T20:03:32Z

Depends on #37228, and will not pass tests until that PR is merged, and this is rebased.

Keeps all kubelet behavior the same, except the kubelet will not make the "Delete" call (kubeClient.Core().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)) until the volumes associated with that pod are removed. I will perform some performance testing so that we better understand the latency impact of this change.

Is kubelet_pods.go the correct file to include the "when can I delete this pod" logic?

cc: @vishh @sjenning @derekwaynecarr

k8s-reviewable · 2017-01-20T20:03:42Z

This change is

vishh · 2017-01-20T20:05:19Z

Is kubelet_pods.go the correct file to include the "when can I delete this pod" logic?

It can be in kubernetes/pkg/kubelet go package for now. File name is up to you.

vishh · 2017-01-20T20:09:10Z

This PR mostly LGTM. Can you add the e2e test you created for this purpose to this PR?

dashpole · 2017-01-20T20:10:50Z

@vishh This change is tested by any test that creates or deletes pods. I created the test to make testing my change quicker, as it runs in ~1 minute. I can add it if you would like, but it is largely redundant.

derekwaynecarr

thanks for doing this, just the one question on pending until cgroup is cleaned up as well, we could just block on PodContainerManager.Exists(pod)

derekwaynecarr · 2017-01-20T20:26:19Z

 	return false
 }

+func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool {


i thought @vishh wanted to pend on cgroup deletion as well?

Yes, we should also do that. Is: "kl.containerManager.NewPodContainerManager().Exists(pod)" the correct statement?

that looks right to me.

derekwaynecarr · 2017-01-20T20:26:42Z

 	// Map from (mirror) pod UID to latest status version successfully sent to the API server.
 	// apiStatusVersions must only be accessed from the sync thread.
 	apiStatusVersions map[types.UID]uint64
+	// A function which returns true if the pod can saftely be deleted


nit: safely

derekwaynecarr · 2017-01-20T20:28:41Z

i would prefer to put this function in kubelet_pods.go (cc @pmorie ) .

derekwaynecarr · 2017-01-20T20:29:50Z

 	return false
 }

+func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool {


add some godoc here explaining what and why this exists as it will not be obvious to future folks... i know for a fact the relationship around the whole delete pod flow in kubelet is black magic to many.

derekwaynecarr · 2017-01-20T20:32:03Z

+		return false
+	}
+	if kl.podVolumesExist(pod.UID) && !kl.kubeletConfiguration.KeepTerminatedPodVolumes {
+		// We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes


is it worth logging here or is there sufficient logging elsewhere if/when the volume could not be cleaned up (suspect there is, but worth confirming)

There is logging related to volume unmounting in pkg/kubelet/volumemanager/reconciler/reconciler.go. My only concern is that most debugging people wont make the connection that their pod is not being deleted because unmounting a volume failed. But I'm not sure how to notify users without logging every time they try to delete a pod (which is useless).

gnufied · 2017-01-20T20:36:59Z

cc @kubernetes/sig-storage-misc

vishh · 2017-01-20T21:09:23Z

We might want to trigger deletion of volumes by posting a signal to the volume manager using channels for example. Not necessary for this PR, but we should watch out for test timeouts once this PR goes in.

…

On Fri, Jan 20, 2017 at 12:46 PM, k8s-ci-robot ***@***.***> wrote: Jenkins verification *failed* <https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/40239/pull-kubernetes-verify/13653/> for commit 7ddb531 <7ddb531>. Full PR test history <http://pr-test.k8s.io/40239>. cc @dashpole <https://github.com/dashpole> The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue <https://github.com/kubernetes/kubernetes/issues?q=is:issue+label:kind/flake+is:open> when you hit one in your PR. Instructions for interacting with me using PR comments are available here <https://github.com/kubernetes/community/blob/master/contributors/devel/pull-request-commands.md>. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra <https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:> repository. I understand the commands that are listed here <https://github.com/kubernetes/test-infra/blob/master/prow/commands.md>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40239 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKLYyDMsHEaKbFiXDTlsEbYrFvrmvks5rUR0LgaJpZM4LpsS4> .

sjenning · 2017-01-20T21:20:19Z

@dashpole #37228 merged about 15 minutes after you posted this :) You can rebase now to see if the tests pass. The PR lgtm.

yujuhong · 2017-01-20T21:23:05Z

Keeps all kubelet behavior the same, except the kubelet will not make the "Delete" call (kubeClient.Core().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)) until the volumes associated with that pod are removed.

When/how will kubelet trigger the pod deletion after the volumes are cleaned up?

vishh · 2017-01-20T20:06:59Z

 	// apiStatusVersions must only be accessed from the sync thread.
 	apiStatusVersions map[types.UID]uint64
+	// A function which returns true if the pod can saftely be deleted
+	safeToDeletePod func(pod *v1.Pod) bool


I'd recommend passing an interface instead.

type PodDeletionSafety interface { OkToDelete(pod *v1.Pod) (bool, error) }

for example. You can then pass the Kubelet object itself.

Great idea. I will change that.

vishh · 2017-01-20T20:07:40Z

 }

+func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool {
+	if kubepod.IsMirrorPod(pod) {


I dont have full context on how mirror pods work. This is taken directly from the old status manager code.

vishh · 2017-01-20T20:08:29Z

+		glog.V(3).Infof("Pod %q is terminated, but some containers are still running", format.Pod(pod))
+		return false
+	}
+	if kl.podVolumesExist(pod.UID) && !kl.kubeletConfiguration.KeepTerminatedPodVolumes {


We need to add pod cgroups here. I'm ok with adding that in a separate PR.

vishh · 2017-01-20T21:33:23Z

Sorry, I failed to post my comments earlier.

yujuhong · 2017-01-20T22:44:30Z

When/how will kubelet trigger the pod deletion after the volumes are cleaned up?

I spoke to @dashpole offline. We need to verify that the deletion will still be triggered eventually before we can merge this PR.

dashpole · 2017-01-24T19:17:19Z

When/how will kubelet trigger the pod deletion after the volumes are cleaned up?

This is an issue, and I am exploring a couple solutions, and making sure I understand the problem.

There actually is a test just like the one I made: test/e2e_node/simple_mount.go

I havent been able to get cleaning up cgroups.

Based on my tests, cleaning up volumes adds ~2 seconds to pod deletion time. Is this acceptable?

dashpole · 2017-01-25T03:17:13Z

All node e2e tests passed when I ran it in my own project, ill have to take a look at why the GCE etcd3 e2e test failed.

vishh · 2017-01-26T18:20:25Z

/approve

k8s-github-robot · 2017-01-26T18:21:36Z

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: vishh

Needs approval from an approver in each of these OWNERS Files:

~~pkg/kubelet/OWNERS~~ [vishh]
~~pkg/kubelet/volumemanager/OWNERS~~ [vishh]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-01-26T20:46:08Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

dashpole · 2017-01-26T21:06:47Z

@k8s-bot bazel test this

dashpole · 2017-01-26T21:34:19Z

@k8s-bot gci gke e2e test this
@k8s-bot bazel test this

k8s-ci-robot · 2017-01-26T21:45:25Z

@dashpole: The following test(s) failed:

Test name	Commit	Details	Rerun command
Jenkins GCI GKE smoke e2e	`b1b9686`	link	`@k8s-bot gci gke e2e test this`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-01-26T21:46:02Z

Automatic merge from submit-queue

dashpole · 2017-02-01T23:07:31Z

See #40824 for my first round of profiling (excluding this change). It appears that volume cleanup occurs almost immediately after deletion. Ill do some profiling including this change to see how the graph changes.

dashpole · 2017-02-02T19:02:13Z

This change appears to add about 60 seconds of additional latency for pod deletion.
Here is the new profile:

This is the comparison of the 50th, 90th, and 99th percentile of latency. All three increase almost exactly 60 seconds from the previous value.

This is surprisingly high, and I plan to look into why that is.

jingxu97 · 2017-02-06T17:34:15Z

@dashpole, just want to check whether you are still working on this PR? I reviewed a PR recently which might depends on this one. Thanks!

dashpole · 2017-02-06T18:28:49Z

I am still working on this. I think I have found why it added so much delay, but I need to confirm that.

dashpole · 2017-02-06T19:36:30Z

Thanks to @Random-Liu for helping me debug this.
This graph shows deletion INCLUDING this change with a QPS limit of 5.

This graph shows deletion INCLUDING this change with a QPS limit of 60.

As I have discussed with @vishh in the past, this change currently requires an additional round trip:

Deletion timestamp set via API call to kubelet; pod manager updated to include deletion timestamp (DARK GREY)
Containers killed, causes status update to the API server (RED)
ROUND TRIP to API server
Containers terminated reflected back to kubelet via API call; pod manager updated to include terminated containers
Volume Manager tears down volumes, as it sees that containers are terminated via pod manager. (LIGHT GREY)
Pod is Deleted from the API server. (PURPLE)

The problem that I did not realize existed is that round trips to the API server take an enormous amount of time when trying to update/delete 105 pods due to QPS constraints. This is easy to see when comparing the first and second graphs.

dashpole · 2017-02-06T21:23:26Z

To verify my theory, I did the following:

In Step 2 above, I also updated the pod manager to reflect that the containers had died

The following are the same as the graphs above, but including this change.

QPS = 5

QPS = 60

It is clear that this change eliminates entirely the gap between containers being killed (RED) and volumes being cleaned up (LIGHT GREY).

dashpole · 2017-02-06T21:32:40Z

Since messing with the syncloop is a bad idea, I plan to modify the volume manager to use the status manager instead of the pod manager to determine when it can clean up a pod. This should have the same effect as the syncloop change since it eliminates the roundtrip to the API server

@vishh

Automatic merge from submit-queue [Kubelet] Delay deletion of pod from the API server until volumes are deleted Previous PR that was reverted: #40239. To summarize the conclusion of the previous PR after reverting: - The status manager has the most up-to-date status, but the volume manager uses the status from the pod manager, which only is as up-to-date as the API server. - Because of this, the previous change required an additional round trip between the kubelet and API server. - When few pods are being added or deleted, this is only a minor issue. However, when under heavy load, the QPS limit to the API server causes this round trip to take ~60 seconds, which is an unacceptable increase in latency. Take a look at the graphs in #40239 to see the effect of QPS changes on timing. - To remedy this, the volume manager looks at the status from the status manager, which eliminates the round trip. cc: @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc

jingxu97 · 2017-02-09T22:55:20Z

@dashpole Could you explain more about your fix?

Since messing with the syncloop is a bad idea, I plan to modify the volume manager to use the status manager instead of the pod manager to determine when it can clean up a pod. This should have the same effect as the syncloop change since it eliminates the roundtrip to the API server

Which "syncloop" you mean in the above comment? What is the status manager?

dashpole · 2017-02-09T23:27:11Z

The syncloop is the core kubelet thread that handles all updates to pod state. I wanted to avoid modifying it because it can have a lot of unforeseen impacts. The status manager is a kubelet component that is responsible for processing pod status updates, and propagates those updates to the API server. The pod manager is another kubelet component that keeps internal state about pods. The pod manager is updated when the kubelet receives a pod from the API server. In general, it is better to refer to the status manager for the most up-to-date pod status. However, the volume manager was checking to see if a pod was terminated based on the pod manager. This meant that in order for the volume manager to clean up a pod, the status change indicating containers had died had to propegate: -> status manager -> api server -> pod manager -> volume manager.
My change is to make the volume manager check the status manager when checking to see if a pod is terminated: ->status manager -> volume manager. This eliminates the round trip from the previous behavior.

dashpole · 2017-02-13T21:07:51Z

I am investigating a bug possibly produced by this PR. I see the following in the kubelet logs during runs that are >2 minutes long:
Feb 13 20:22:02 test-gci-dev-57-9202-20-0 kubelet[7670]: I0213 20:22:02.242720 7670 status_manager.go:439] Status for pod "pod-with-prestop-exec-hook_e2e-tests-container-lifecycle-hook-3b0pj(07a0b6d4-f22a-11e6-9fae-42010a800002)" updated successfully: {status:{Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614102 nsec:0 loc:0x49cad80}} Reason: Message:} {Type:Ready Status:False LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614122 nsec:0 loc:0x49cad80}} Reason:ContainersNotReady Message:containers with unready status: [pod-with-prestop-exec-hook]} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614103 nsec:0 loc:0x49cad80}} Reason: Message:}] Message: Reason: HostIP:10.128.0.2 PodIP: StartTime:0xc420fd7f00 InitContainerStatuses:[] ContainerStatuses:[{Name:pod-with-prestop-exec-hook State:{Waiting:<nil> Running:<nil> Terminated:0xc420ed2700} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:gcr.io/google_containers/busybox:1.24 ImageID:docker-pullable://gcr.io/google_containers/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff ContainerID:docker://cacdcea4166da97894fb0ba3fd49ee02aef4dd4fd90f3d050983ab21556df30a}] QOSClass:BestEffort} version:5 podName:pod-with-prestop-exec-hook podNamespace:e2e-tests-container-lifecycle-hook-3b0pj}

Feb 13 20:22:02 test-gci-dev-57-9202-20-0 kubelet[7670]: I0213 20:22:02.243002 7670 kubelet_pods.go:745] Pod "pod-with-prestop-exec-hook_e2e-tests-container-lifecycle-hook-3b0pj(07a0b6d4-f22a-11e6-9fae-42010a800002)" is terminated, but some volumes have not been cleaned up

Feb 13 20:22:02 test-gci-dev-57-9202-20-0 kubelet[7670]: I0213 20:22:02.250713 7670 status_manager.go:439] Status for pod "pod-with-prestop-exec-hook_e2e-tests-container-lifecycle-hook-3b0pj(07a0b6d4-f22a-11e6-9fae-42010a800002)" updated successfully: {status:{Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614102 nsec:0 loc:0x49cad80}} Reason: Message:} {Type:Ready Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614122 nsec:0 loc:0x49cad80}} Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614103 nsec:0 loc:0x49cad80}} Reason: Message:}] Message: Reason: HostIP:10.128.0.2 PodIP:10.180.0.246 StartTime:0xc420fd7f00 InitContainerStatuses:[] ContainerStatuses:[{Name:pod-with-prestop-exec-hook State:{Waiting:<nil> Running:0xc420bd38e0 Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:true RestartCount:0 Image:gcr.io/google_containers/busybox:1.24 ImageID:docker-pullable://gcr.io/google_containers/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff ContainerID:docker://cacdcea4166da97894fb0ba3fd49ee02aef4dd4fd90f3d050983ab21556df30a}] QOSClass:BestEffort} version:6 podName:pod-with-prestop-exec-hook podNamespace:e2e-tests-container-lifecycle-hook-3b0pj} Feb 13 20:22:02 test-gci-dev-57-9202-20-0 kubelet[7670]: I0213 20:22:02.250793 7670 kubelet_pods.go:740] Pod "pod-with-prestop-exec-hook_e2e-tests-container-lifecycle-hook-3b0pj(07a0b6d4-f22a-11e6-9fae-42010a800002)" is terminated, but some containers are still running

These lines are followed by a long period of inaction on the pod, and finally a syncloop that updates the status to failed (correct), volumes are unmounted, and then pod deletion.

To summarize:
Set Pod Status to Failed
Cant delete pod because volumes are still mounted
Set Pod Status to Running
..... long period of time ~1-2 minutes .......
syncloop sync pod
Set Pod Status to Failed
Unmount Pod's volumes
Successfully delete Pod

dashpole · 2017-02-13T21:34:17Z

I have been unable to replicate this on head, so it must be a problem introduced by this PR

dashpole · 2017-02-14T20:13:46Z

Mystery solved!

@vish

Automatic merge from submit-queue Fix bug in status manager TerminatePod In TerminatePod, we previously pass pod.Status to updateStatusInternal. This is a bug, since it is the original status that we are given. Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially. This was the case in #40239 and #41095. As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected. This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?). @kubernetes/sig-node-bugs @vish @Random-Liu

@vishh

Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416) Delay Deletion of a Pod until volumes are cleaned up #41436 fixed the bug that caused #41095 and #40239 to have to be reverted. Now that the bug is fixed, this shouldn't cause problems. @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 20, 2017

dashpole changed the title ~~dont delete pods until volumes are deleted~~ Delay deletion of pod from the API server until volumes are deleted Jan 20, 2017

k8s-github-robot assigned vishh Jan 20, 2017

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Jan 20, 2017

derekwaynecarr reviewed Jan 20, 2017

View reviewed changes

dashpole force-pushed the disk_pod_lifecycle branch from 7ddb531 to 44ce93c Compare January 20, 2017 21:29

vishh reviewed Jan 20, 2017

View reviewed changes

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 20, 2017

dashpole force-pushed the disk_pod_lifecycle branch from a40b822 to abda617 Compare January 24, 2017 19:57

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2017

dashpole force-pushed the disk_pod_lifecycle branch 2 times, most recently from a7679a8 to 96cff41 Compare January 25, 2017 00:25

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2017

k8s-github-robot merged commit 7857b9a into kubernetes:master Jan 26, 2017

dashpole deleted the disk_pod_lifecycle branch January 26, 2017 21:49

aleksandra-malinowska mentioned this pull request Jan 27, 2017

Revert "Delay deletion of pod from the API server until volumes are deleted" #40592

Merged

jingxu97 mentioned this pull request Feb 3, 2017

Fix issue #34242: Attach/detach should recover from a crash #39732

Merged

dashpole mentioned this pull request Feb 7, 2017

[Kubelet] Delay deletion of pod from the API server until volumes are deleted #41095

Merged

dashpole mentioned this pull request Feb 14, 2017

Fix bug in status manager TerminatePod #41436

Merged

dashpole mentioned this pull request Feb 15, 2017

Delay Deletion of a Pod until volumes are cleaned up #41456

Merged

Conversation

dashpole commented Jan 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-reviewable commented Jan 20, 2017

Uh oh!

vishh commented Jan 20, 2017

Uh oh!

vishh commented Jan 20, 2017

Uh oh!

dashpole commented Jan 20, 2017

Uh oh!

derekwaynecarr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derekwaynecarr commented Jan 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gnufied commented Jan 20, 2017

Uh oh!

vishh commented Jan 20, 2017 via email

Uh oh!

sjenning commented Jan 20, 2017

Uh oh!

yujuhong commented Jan 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vishh commented Jan 20, 2017

Uh oh!

yujuhong commented Jan 20, 2017

Uh oh!

dashpole commented Jan 24, 2017

Uh oh!

dashpole commented Jan 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vishh commented Jan 26, 2017

Uh oh!

k8s-github-robot commented Jan 26, 2017

Uh oh!

k8s-github-robot commented Jan 26, 2017

Uh oh!

dashpole commented Jan 26, 2017

Uh oh!

dashpole commented Jan 26, 2017

Uh oh!

k8s-ci-robot commented Jan 26, 2017

Uh oh!

k8s-github-robot commented Jan 26, 2017

Uh oh!

dashpole commented Feb 1, 2017

Uh oh!

dashpole commented Jan 20, 2017 •

edited

Loading

dashpole commented Jan 25, 2017 •

edited

Loading

dashpole commented Feb 9, 2017 •

edited

Loading