[Kubelet] Delay deletion of pod from the API server until volumes are deleted#41095
Conversation
|
I also just ran all node-e2e tests on this PR on all images that we test on, and it passed all tests without any flakes. |
|
And just for reference, this is what the profile looked like before this change. The difference is that previously volume teardown (light grey) occurred after deletion finished (purple). Now volume teardown happens immediately after the containers are terminated. note: the grey line should be completely after the purple line in this graph. This is due to one of the pods having more than 1 volume. #40934 |
|
@dashpole: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@k8s-bot gce etcd3 e2e test this |
|
Can you squash the last two commits as it will make reverts easy via git? |
|
/lgtm |
|
squash the commits and I will approve it |
|
wanted to make it easier to review, as the last two commits are the only changes from the previous PR |
c130ecd to
285706b
Compare
|
/LGTM |
285706b to
67cb270
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED The following people have approved this PR: dashpole, vishh Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
|
@vishh rebasing removed the lgtm... |
|
Automatic merge from submit-queue |
|
@dashpole I'm not sure if this is the culprit, but since this PR merged, 5 out of the 6 runs of https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet/ have failed waiting for a pod to be deleted. WDYT? |
|
Or is that what #41157 is trying to address? |
|
Yes, that is what #41157 is trying to address. |
|
@ncdc with regards to the node-kubelet tests, I am surprised that it failed so often on those runs compared with the node-e2e testgrid runs, which have only flaked occasionally... |
Automatic merge from submit-queue Fix bug in status manager TerminatePod In TerminatePod, we previously pass pod.Status to updateStatusInternal. This is a bug, since it is the original status that we are given. Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially. This was the case in #40239 and #41095. As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected. This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?). @kubernetes/sig-node-bugs @vish @Random-Liu
Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416) Delay Deletion of a Pod until volumes are cleaned up #41436 fixed the bug that caused #41095 and #40239 to have to be reverted. Now that the bug is fixed, this shouldn't cause problems. @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc


Previous PR that was reverted: #40239.
To summarize the conclusion of the previous PR after reverting:
cc: @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc