Fix bug in status manager TerminatePod by dashpole · Pull Request #41436 · kubernetes/kubernetes

dashpole · 2017-02-14T20:13:12Z

In TerminatePod, we previously pass pod.Status to updateStatusInternal. This is a bug, since it is the original status that we are given. Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially.

This was the case in #40239 and #41095. As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected.

This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?).

@kubernetes/sig-node-bugs @vish @Random-Liu

k8s-reviewable · 2017-02-14T20:13:20Z

This change is

Random-Liu · 2017-02-14T20:17:35Z

/cc @yujuhong

Random-Liu · 2017-02-14T20:18:47Z

 	go wait.Forever(func() {
 		select {
 		case syncRequest := <-m.podStatusChannel:
+			glog.V(10).Infof("Status Manager: syncing pod: %v, with status: (%v, %v) from podStatusChannel",


V(10) seems a little bit too much, V(5) should be fine, I think.

Random-Liu · 2017-02-14T20:19:23Z


 	select {
 	case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:
+		glog.V(10).Infof("Status Manager: adding pod: %v, with status: (%v, %v) to podStatusChannel",


We usually use %q for string.

Random-Liu · 2017-02-14T20:19:28Z

 	}()

 	for _, update := range updatedStatuses {
+		glog.V(10).Infof("Status Manager: syncPod in syncbatch. pod UID: %v", update.podUID)


We usually use %q for string.

Random-Liu · 2017-02-14T21:08:18Z

+	status := expectPodStatus(t, syncer, testPod)
+	for i := range status.ContainerStatuses {
+		if status.ContainerStatuses[i].State.Terminated == nil {
+			t.Errorf("expected containers to be terminated")


Use assert.

Random-Liu · 2017-02-14T21:08:27Z

+	}
+	for i := range status.InitContainerStatuses {
+		if status.InitContainerStatuses[i].State.Terminated == nil {
+			t.Errorf("expected init containers to be terminated")


Random-Liu · 2017-02-14T21:08:37Z

+			t.Errorf("expected init containers to be terminated")
+		}
+	}
+	if status.Phase != v1.PodFailed {


Random-Liu · 2017-02-14T21:09:23Z


+func TestTerminatePod(t *testing.T) {
+	syncer := newTestManager(&fake.Clientset{})
+	testPod := getTestPod()


Could you set a fake status here with Running phase, and running container? And add some comment about why we add this test.

Random-Liu · 2017-02-14T22:17:24Z

/lgtm

vishh · 2017-02-14T22:40:15Z

/approved

vishh · 2017-02-14T22:40:31Z

Good catch @dashpole

dashpole · 2017-02-14T23:48:01Z

can someone add the approved label? Not sure why @vishh 's comment didnt trigger it...

yujuhong · 2017-02-14T23:50:01Z

/lgtm
/approve

Thanks for the fix!

k8s-github-robot · 2017-02-14T23:51:33Z

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: dashpole, yujuhong

Needs approval from an approver in each of these OWNERS Files:

~~pkg/kubelet/OWNERS~~ [yujuhong]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-02-15T05:03:31Z

Automatic merge from submit-queue

smarterclayton · 2017-02-15T20:04:54Z

Does this need backporting to 1.5? Violating the phase order is a very large problem, wasn't clear if this was introduced in 1.6 or earlier.

dashpole · 2017-02-15T20:48:15Z

Git history says this was introduced by #22155, which was merged before the 1.2 release. However, since a terminated pod is almost always deleted immediately, I have not been able to replicate the issues that #40239 encountered. Since #40239 delays deletion of pods in most cases, this bug manifested as a long delay in deletion with very high frequency.
@smarterclayton I would recommend cherrypicking this to 1.5. Thanks for bringing that up.

vishh · 2017-02-15T20:49:58Z

v1.4 branch is also getting new patch releases. We should cherrypick into that as well.

…

On Wed, Feb 15, 2017 at 12:48 PM, David Ashpole ***@***.***> wrote: Git history says this was introduced by #22155 <#22155>, which was merged before the 1.2 release. However, since a terminated pod is almost always deleted immediately, I have not been able to replicate the issues that #40239 <#40239> encountered. Since #40239 <#40239> delays deletion of pods in most cases, this bug manifested as a long delay in deletion with very high frequency. @smarterclayton <https://github.com/smarterclayton> I would recommend cherrypicking this to 1.5. Thanks for bringing that up. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#41436 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKBnwSzr1zJy9375HNwyeM083jctDks5rc2SugaJpZM4MA8Kt> .

dashpole · 2017-02-16T16:48:29Z

can someone add the cherrypick candidate label?

k8s-cherrypick-bot · 2017-02-16T16:55:48Z

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@vishh

Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416) Delay Deletion of a Pod until volumes are cleaned up #41436 fixed the bug that caused #41095 and #40239 to have to be reverted. Now that the bug is fixed, this shouldn't cause problems. @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc

k8s-cherrypick-bot · 2017-04-18T21:00:20Z

Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 14, 2017

k8s-github-robot assigned yujuhong Feb 14, 2017

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Feb 14, 2017

Random-Liu added release-note-none Denotes a PR that doesn't merit a release note. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed release-note-label-needed labels Feb 14, 2017

Random-Liu self-assigned this Feb 14, 2017

Random-Liu reviewed Feb 14, 2017

View reviewed changes

use the status we modify, not original

c612e09

dashpole force-pushed the status_bug branch from eee036b to c612e09 Compare February 14, 2017 21:38

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2017

k8s-github-robot merged commit a57967f into kubernetes:master Feb 15, 2017

dashpole deleted the status_bug branch February 15, 2017 05:03

dashpole mentioned this pull request Feb 15, 2017

Delay Deletion of a Pod until volumes are cleaned up #41456

Merged

yujuhong added the cherrypick-candidate label Feb 16, 2017

k8s-cherrypick-bot removed the cherrypick-candidate label Feb 16, 2017

yujuhong added this to the v1.5 milestone Feb 16, 2017

yujuhong added the cherrypick-candidate label Feb 16, 2017

derekwaynecarr mentioned this pull request Feb 16, 2017

UPSTREAM: 41436: Fix bug in status manager TerminatePod openshift/origin#12994

Closed

This was referenced Mar 13, 2017

UPSTREAM: 41436: Fix bug in status manager TerminatePod openshift/origin#13377

Merged

UPSTREAM: 41436: Fix bug in status manager TerminatePod openshift/origin#13378

Merged

k8s-cherrypick-bot removed the cherrypick-candidate label Apr 18, 2017

mYmNeo mentioned this pull request Nov 13, 2017

PodWorker drops some important update event, cause pod can't be deleted #52641

Closed

Conversation

dashpole commented Feb 14, 2017

Uh oh!

k8s-reviewable commented Feb 14, 2017

Uh oh!

Random-Liu commented Feb 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Random-Liu commented Feb 14, 2017

Uh oh!

vishh commented Feb 14, 2017

Uh oh!

vishh commented Feb 14, 2017

Uh oh!

dashpole commented Feb 14, 2017

Uh oh!

yujuhong commented Feb 14, 2017

Uh oh!

k8s-github-robot commented Feb 14, 2017

Uh oh!

k8s-github-robot commented Feb 15, 2017

Uh oh!

smarterclayton commented Feb 15, 2017

Uh oh!

dashpole commented Feb 15, 2017

Uh oh!

vishh commented Feb 15, 2017 via email

Uh oh!

dashpole commented Feb 16, 2017

Uh oh!

k8s-cherrypick-bot commented Feb 16, 2017

Uh oh!

k8s-cherrypick-bot commented Apr 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants