Skip to content

Delay deletion of pod from the API server until volumes are deleted#40239

Merged
k8s-github-robot merged 2 commits into
kubernetes:masterfrom
dashpole:disk_pod_lifecycle
Jan 26, 2017
Merged

Delay deletion of pod from the API server until volumes are deleted#40239
k8s-github-robot merged 2 commits into
kubernetes:masterfrom
dashpole:disk_pod_lifecycle

Conversation

@dashpole

@dashpole dashpole commented Jan 20, 2017

Copy link
Copy Markdown
Contributor

Depends on #37228, and will not pass tests until that PR is merged, and this is rebased.

Keeps all kubelet behavior the same, except the kubelet will not make the "Delete" call (kubeClient.Core().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)) until the volumes associated with that pod are removed. I will perform some performance testing so that we better understand the latency impact of this change.

Is kubelet_pods.go the correct file to include the "when can I delete this pod" logic?

cc: @vishh @sjenning @derekwaynecarr

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 20, 2017
@k8s-reviewable

Copy link
Copy Markdown

This change is Reviewable

@vishh

vishh commented Jan 20, 2017

Copy link
Copy Markdown
Contributor

Is kubelet_pods.go the correct file to include the "when can I delete this pod" logic?

It can be in kubernetes/pkg/kubelet go package for now. File name is up to you.

@dashpole dashpole changed the title dont delete pods until volumes are deleted Delay deletion of pod from the API server until volumes are deleted Jan 20, 2017
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Jan 20, 2017
@vishh

vishh commented Jan 20, 2017

Copy link
Copy Markdown
Contributor

This PR mostly LGTM. Can you add the e2e test you created for this purpose to this PR?

@dashpole

Copy link
Copy Markdown
Contributor Author

@vishh This change is tested by any test that creates or deletes pods. I created the test to make testing my change quicker, as it runs in ~1 minute. I can add it if you would like, but it is largely redundant.

@derekwaynecarr derekwaynecarr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this, just the one question on pending until cgroup is cleaned up as well, we could just block on PodContainerManager.Exists(pod)

Comment thread pkg/kubelet/kubelet_pods.go Outdated
return false
}

func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought @vishh wanted to pend on cgroup deletion as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should also do that. Is: "kl.containerManager.NewPodContainerManager().Exists(pod)" the correct statement?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks right to me.

Comment thread pkg/kubelet/status/status_manager.go Outdated
// Map from (mirror) pod UID to latest status version successfully sent to the API server.
// apiStatusVersions must only be accessed from the sync thread.
apiStatusVersions map[types.UID]uint64
// A function which returns true if the pod can saftely be deleted

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: safely

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@derekwaynecarr

Copy link
Copy Markdown
Member

i would prefer to put this function in kubelet_pods.go (cc @pmorie ) .

Comment thread pkg/kubelet/kubelet_pods.go Outdated
return false
}

func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some godoc here explaining what and why this exists as it will not be obvious to future folks... i know for a fact the relationship around the whole delete pod flow in kubelet is black magic to many.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return false
}
if kl.podVolumesExist(pod.UID) && !kl.kubeletConfiguration.KeepTerminatedPodVolumes {
// We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth logging here or is there sufficient logging elsewhere if/when the volume could not be cleaned up (suspect there is, but worth confirming)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is logging related to volume unmounting in pkg/kubelet/volumemanager/reconciler/reconciler.go. My only concern is that most debugging people wont make the connection that their pod is not being deleted because unmounting a volume failed. But I'm not sure how to notify users without logging every time they try to delete a pod (which is useless).

@gnufied

gnufied commented Jan 20, 2017

Copy link
Copy Markdown
Member

cc @kubernetes/sig-storage-misc

@vishh

vishh commented Jan 20, 2017 via email

Copy link
Copy Markdown
Contributor

@sjenning

Copy link
Copy Markdown
Contributor

@dashpole #37228 merged about 15 minutes after you posted this :) You can rebase now to see if the tests pass. The PR lgtm.

@yujuhong

Copy link
Copy Markdown
Contributor

Keeps all kubelet behavior the same, except the kubelet will not make the "Delete" call (kubeClient.Core().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)) until the volumes associated with that pod are removed.

When/how will kubelet trigger the pod deletion after the volumes are cleaned up?

Comment thread pkg/kubelet/status/status_manager.go Outdated
// apiStatusVersions must only be accessed from the sync thread.
apiStatusVersions map[types.UID]uint64
// A function which returns true if the pod can saftely be deleted
safeToDeletePod func(pod *v1.Pod) bool

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend passing an interface instead.

type PodDeletionSafety interface {
  OkToDelete(pod *v1.Pod) (bool, error)
}

for example. You can then pass the Kubelet object itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. I will change that.

Comment thread pkg/kubelet/kubelet_pods.go Outdated
}

func (kl *Kubelet) safeToDeletePod(pod *v1.Pod) bool {
if kubepod.IsMirrorPod(pod) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have full context on how mirror pods work. This is taken directly from the old status manager code.

glog.V(3).Infof("Pod %q is terminated, but some containers are still running", format.Pod(pod))
return false
}
if kl.podVolumesExist(pod.UID) && !kl.kubeletConfiguration.KeepTerminatedPodVolumes {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add pod cgroups here. I'm ok with adding that in a separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@vishh

vishh commented Jan 20, 2017

Copy link
Copy Markdown
Contributor

Sorry, I failed to post my comments earlier.

@yujuhong

Copy link
Copy Markdown
Contributor

When/how will kubelet trigger the pod deletion after the volumes are cleaned up?

I spoke to @dashpole offline. We need to verify that the deletion will still be triggered eventually before we can merge this PR.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 20, 2017
@dashpole

Copy link
Copy Markdown
Contributor Author

When/how will kubelet trigger the pod deletion after the volumes are cleaned up?

This is an issue, and I am exploring a couple solutions, and making sure I understand the problem.

There actually is a test just like the one I made: test/e2e_node/simple_mount.go

I havent been able to get cleaning up cgroups.

Based on my tests, cleaning up volumes adds ~2 seconds to pod deletion time. Is this acceptable?

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2017
@dashpole dashpole force-pushed the disk_pod_lifecycle branch 2 times, most recently from a7679a8 to 96cff41 Compare January 25, 2017 00:25
@dashpole

dashpole commented Jan 25, 2017

Copy link
Copy Markdown
Contributor Author

All node e2e tests passed when I ran it in my own project, ill have to take a look at why the GCE etcd3 e2e test failed.

@vishh

vishh commented Jan 26, 2017

Copy link
Copy Markdown
Contributor

/approve

@k8s-github-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: vishh

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2017
@k8s-github-robot

Copy link
Copy Markdown

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@dashpole

Copy link
Copy Markdown
Contributor Author

@k8s-bot bazel test this

@dashpole

Copy link
Copy Markdown
Contributor Author

@k8s-bot gci gke e2e test this
@k8s-bot bazel test this

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@dashpole: The following test(s) failed:

Test name Commit Details Rerun command
Jenkins GCI GKE smoke e2e b1b9686 link @k8s-bot gci gke e2e test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot

Copy link
Copy Markdown

Automatic merge from submit-queue

@dashpole

dashpole commented Feb 1, 2017

Copy link
Copy Markdown
Contributor Author

See #40824 for my first round of profiling (excluding this change). It appears that volume cleanup occurs almost immediately after deletion. Ill do some profiling including this change to see how the graph changes.

@dashpole

dashpole commented Feb 2, 2017

Copy link
Copy Markdown
Contributor Author

This change appears to add about 60 seconds of additional latency for pod deletion.
Here is the new profile:
screenshot from 2017-02-02 10 57 53

This is the comparison of the 50th, 90th, and 99th percentile of latency. All three increase almost exactly 60 seconds from the previous value.
screenshot from 2017-02-02 11 00 04

This is surprisingly high, and I plan to look into why that is.

@jingxu97

jingxu97 commented Feb 6, 2017

Copy link
Copy Markdown
Contributor

@dashpole, just want to check whether you are still working on this PR? I reviewed a PR recently which might depends on this one. Thanks!

@dashpole

dashpole commented Feb 6, 2017

Copy link
Copy Markdown
Contributor Author

I am still working on this. I think I have found why it added so much delay, but I need to confirm that.

@dashpole

dashpole commented Feb 6, 2017

Copy link
Copy Markdown
Contributor Author

Thanks to @Random-Liu for helping me debug this.
This graph shows deletion INCLUDING this change with a QPS limit of 5.
screenshot from 2017-02-06 11 00 58

This graph shows deletion INCLUDING this change with a QPS limit of 60.
screenshot from 2017-02-06 11 01 53

As I have discussed with @vishh in the past, this change currently requires an additional round trip:

  1. Deletion timestamp set via API call to kubelet; pod manager updated to include deletion timestamp (DARK GREY)
  2. Containers killed, causes status update to the API server (RED)
  3. ROUND TRIP to API server
  4. Containers terminated reflected back to kubelet via API call; pod manager updated to include terminated containers
  5. Volume Manager tears down volumes, as it sees that containers are terminated via pod manager. (LIGHT GREY)
  6. Pod is Deleted from the API server. (PURPLE)

The problem that I did not realize existed is that round trips to the API server take an enormous amount of time when trying to update/delete 105 pods due to QPS constraints. This is easy to see when comparing the first and second graphs.

@dashpole

dashpole commented Feb 6, 2017

Copy link
Copy Markdown
Contributor Author

To verify my theory, I did the following:

  • In Step 2 above, I also updated the pod manager to reflect that the containers had died

The following are the same as the graphs above, but including this change.

QPS = 5
screenshot from 2017-02-06 13 14 44

QPS = 60
screenshot from 2017-02-06 13 14 27

It is clear that this change eliminates entirely the gap between containers being killed (RED) and volumes being cleaned up (LIGHT GREY).

@dashpole

dashpole commented Feb 6, 2017

Copy link
Copy Markdown
Contributor Author

Since messing with the syncloop is a bad idea, I plan to modify the volume manager to use the status manager instead of the pod manager to determine when it can clean up a pod. This should have the same effect as the syncloop change since it eliminates the roundtrip to the API server

k8s-github-robot pushed a commit that referenced this pull request Feb 8, 2017
Automatic merge from submit-queue

[Kubelet] Delay deletion of pod from the API server until volumes are deleted

Previous PR that was reverted: #40239.

To summarize the conclusion of the previous PR after reverting:
- The status manager has the most up-to-date status, but the volume manager uses the status from the pod manager, which only is as up-to-date as the API server.
- Because of this, the previous change required an additional round trip between the kubelet and API server.
- When few pods are being added or deleted, this is only a minor issue.  However, when under heavy load, the QPS limit to the API server causes this round trip to take ~60 seconds, which is an unacceptable increase in latency. Take a look at the graphs in #40239 to see the effect of QPS changes on timing.
- To remedy this, the volume manager looks at the status from the status manager, which eliminates the round trip.

cc: @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc
@jingxu97

jingxu97 commented Feb 9, 2017

Copy link
Copy Markdown
Contributor

@dashpole Could you explain more about your fix?

Since messing with the syncloop is a bad idea, I plan to modify the volume manager to use the status manager instead of the pod manager to determine when it can clean up a pod. This should have the same effect as the syncloop change since it eliminates the roundtrip to the API server

Which "syncloop" you mean in the above comment? What is the status manager?

@dashpole

dashpole commented Feb 9, 2017

Copy link
Copy Markdown
Contributor Author

The syncloop is the core kubelet thread that handles all updates to pod state. I wanted to avoid modifying it because it can have a lot of unforeseen impacts. The status manager is a kubelet component that is responsible for processing pod status updates, and propagates those updates to the API server. The pod manager is another kubelet component that keeps internal state about pods. The pod manager is updated when the kubelet receives a pod from the API server. In general, it is better to refer to the status manager for the most up-to-date pod status. However, the volume manager was checking to see if a pod was terminated based on the pod manager. This meant that in order for the volume manager to clean up a pod, the status change indicating containers had died had to propegate: -> status manager -> api server -> pod manager -> volume manager.
My change is to make the volume manager check the status manager when checking to see if a pod is terminated: ->status manager -> volume manager. This eliminates the round trip from the previous behavior.

@dashpole

Copy link
Copy Markdown
Contributor Author

I am investigating a bug possibly produced by this PR. I see the following in the kubelet logs during runs that are >2 minutes long:
Feb 13 20:22:02 test-gci-dev-57-9202-20-0 kubelet[7670]: I0213 20:22:02.242720 7670 status_manager.go:439] Status for pod "pod-with-prestop-exec-hook_e2e-tests-container-lifecycle-hook-3b0pj(07a0b6d4-f22a-11e6-9fae-42010a800002)" updated successfully: {status:{Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614102 nsec:0 loc:0x49cad80}} Reason: Message:} {Type:Ready Status:False LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614122 nsec:0 loc:0x49cad80}} Reason:ContainersNotReady Message:containers with unready status: [pod-with-prestop-exec-hook]} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614103 nsec:0 loc:0x49cad80}} Reason: Message:}] Message: Reason: HostIP:10.128.0.2 PodIP: StartTime:0xc420fd7f00 InitContainerStatuses:[] ContainerStatuses:[{Name:pod-with-prestop-exec-hook State:{Waiting:<nil> Running:<nil> Terminated:0xc420ed2700} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:gcr.io/google_containers/busybox:1.24 ImageID:docker-pullable://gcr.io/google_containers/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff ContainerID:docker://cacdcea4166da97894fb0ba3fd49ee02aef4dd4fd90f3d050983ab21556df30a}] QOSClass:BestEffort} version:5 podName:pod-with-prestop-exec-hook podNamespace:e2e-tests-container-lifecycle-hook-3b0pj}

Feb 13 20:22:02 test-gci-dev-57-9202-20-0 kubelet[7670]: I0213 20:22:02.243002 7670 kubelet_pods.go:745] Pod "pod-with-prestop-exec-hook_e2e-tests-container-lifecycle-hook-3b0pj(07a0b6d4-f22a-11e6-9fae-42010a800002)" is terminated, but some volumes have not been cleaned up

Feb 13 20:22:02 test-gci-dev-57-9202-20-0 kubelet[7670]: I0213 20:22:02.250713 7670 status_manager.go:439] Status for pod "pod-with-prestop-exec-hook_e2e-tests-container-lifecycle-hook-3b0pj(07a0b6d4-f22a-11e6-9fae-42010a800002)" updated successfully: {status:{Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614102 nsec:0 loc:0x49cad80}} Reason: Message:} {Type:Ready Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614122 nsec:0 loc:0x49cad80}} Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x49cad80}} LastTransitionTime:{Time:{sec:63622614103 nsec:0 loc:0x49cad80}} Reason: Message:}] Message: Reason: HostIP:10.128.0.2 PodIP:10.180.0.246 StartTime:0xc420fd7f00 InitContainerStatuses:[] ContainerStatuses:[{Name:pod-with-prestop-exec-hook State:{Waiting:<nil> Running:0xc420bd38e0 Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:true RestartCount:0 Image:gcr.io/google_containers/busybox:1.24 ImageID:docker-pullable://gcr.io/google_containers/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff ContainerID:docker://cacdcea4166da97894fb0ba3fd49ee02aef4dd4fd90f3d050983ab21556df30a}] QOSClass:BestEffort} version:6 podName:pod-with-prestop-exec-hook podNamespace:e2e-tests-container-lifecycle-hook-3b0pj} Feb 13 20:22:02 test-gci-dev-57-9202-20-0 kubelet[7670]: I0213 20:22:02.250793 7670 kubelet_pods.go:740] Pod "pod-with-prestop-exec-hook_e2e-tests-container-lifecycle-hook-3b0pj(07a0b6d4-f22a-11e6-9fae-42010a800002)" is terminated, but some containers are still running

These lines are followed by a long period of inaction on the pod, and finally a syncloop that updates the status to failed (correct), volumes are unmounted, and then pod deletion.

To summarize:
Set Pod Status to Failed
Cant delete pod because volumes are still mounted
Set Pod Status to Running
..... long period of time ~1-2 minutes .......
syncloop sync pod
Set Pod Status to Failed
Unmount Pod's volumes
Successfully delete Pod

@dashpole

Copy link
Copy Markdown
Contributor Author

I have been unable to replicate this on head, so it must be a problem introduced by this PR

@dashpole

Copy link
Copy Markdown
Contributor Author

Mystery solved!

k8s-github-robot pushed a commit that referenced this pull request Feb 15, 2017
Automatic merge from submit-queue

Fix bug in status manager TerminatePod

In TerminatePod, we previously pass pod.Status to updateStatusInternal.  This is a bug, since it is the original status that we are given.  Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially.

This was the case in #40239 and #41095.  As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected.

This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?).

@kubernetes/sig-node-bugs @vish @Random-Liu
k8s-github-robot pushed a commit that referenced this pull request Feb 16, 2017
Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416)

Delay Deletion of a Pod until volumes are cleaned up

#41436 fixed the bug that caused #41095 and #40239 to have to be reverted.  Now that the bug is fixed, this shouldn't cause problems.

 @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants