Fix bug in status manager TerminatePod#41436
Conversation
|
/cc @yujuhong |
| go wait.Forever(func() { | ||
| select { | ||
| case syncRequest := <-m.podStatusChannel: | ||
| glog.V(10).Infof("Status Manager: syncing pod: %v, with status: (%v, %v) from podStatusChannel", |
There was a problem hiding this comment.
V(10) seems a little bit too much, V(5) should be fine, I think.
|
|
||
| select { | ||
| case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}: | ||
| glog.V(10).Infof("Status Manager: adding pod: %v, with status: (%v, %v) to podStatusChannel", |
There was a problem hiding this comment.
We usually use %q for string.
| }() | ||
|
|
||
| for _, update := range updatedStatuses { | ||
| glog.V(10).Infof("Status Manager: syncPod in syncbatch. pod UID: %v", update.podUID) |
There was a problem hiding this comment.
We usually use %q for string.
| status := expectPodStatus(t, syncer, testPod) | ||
| for i := range status.ContainerStatuses { | ||
| if status.ContainerStatuses[i].State.Terminated == nil { | ||
| t.Errorf("expected containers to be terminated") |
| } | ||
| for i := range status.InitContainerStatuses { | ||
| if status.InitContainerStatuses[i].State.Terminated == nil { | ||
| t.Errorf("expected init containers to be terminated") |
| t.Errorf("expected init containers to be terminated") | ||
| } | ||
| } | ||
| if status.Phase != v1.PodFailed { |
|
|
||
| func TestTerminatePod(t *testing.T) { | ||
| syncer := newTestManager(&fake.Clientset{}) | ||
| testPod := getTestPod() |
There was a problem hiding this comment.
Could you set a fake status here with Running phase, and running container? And add some comment about why we add this test.
|
/lgtm |
|
/approved |
|
Good catch @dashpole |
|
can someone add the approved label? Not sure why @vishh 's comment didnt trigger it... |
|
/lgtm Thanks for the fix! |
|
[APPROVALNOTIFIER] This PR is APPROVED The following people have approved this PR: dashpole, yujuhong Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
|
Automatic merge from submit-queue |
|
Does this need backporting to 1.5? Violating the phase order is a very large problem, wasn't clear if this was introduced in 1.6 or earlier. |
|
Git history says this was introduced by #22155, which was merged before the 1.2 release. However, since a terminated pod is almost always deleted immediately, I have not been able to replicate the issues that #40239 encountered. Since #40239 delays deletion of pods in most cases, this bug manifested as a long delay in deletion with very high frequency. |
|
v1.4 branch is also getting new patch releases. We should cherrypick into
that as well.
…On Wed, Feb 15, 2017 at 12:48 PM, David Ashpole ***@***.***> wrote:
Git history says this was introduced by #22155
<#22155>, which was merged
before the 1.2 release. However, since a terminated pod is almost always
deleted immediately, I have not been able to replicate the issues that
#40239 <#40239> encountered.
Since #40239 <#40239> delays
deletion of pods in most cases, this bug manifested as a long delay in
deletion with very high frequency.
@smarterclayton <https://github.com/smarterclayton> I would recommend
cherrypicking this to 1.5. Thanks for bringing that up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#41436 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKBnwSzr1zJy9375HNwyeM083jctDks5rc2SugaJpZM4MA8Kt>
.
|
|
can someone add the cherrypick candidate label? |
|
Removing label |
Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416) Delay Deletion of a Pod until volumes are cleaned up #41436 fixed the bug that caused #41095 and #40239 to have to be reverted. Now that the bug is fixed, this shouldn't cause problems. @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc
|
Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
In TerminatePod, we previously pass pod.Status to updateStatusInternal. This is a bug, since it is the original status that we are given. Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially.
This was the case in #40239 and #41095. As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected.
This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?).
@kubernetes/sig-node-bugs @vish @Random-Liu