Skip to content

Kubelet reports Ready=true for pod and containers conditions during kubelet level evictions #108594

@bobbypage

Description

@bobbypage

What happened?

In 1.22+, when kubelet evicts a pod, it updates the pod status to api-server marking the pod phase as failed / succeeded. However, the kubelet continues to report Ready=true pod condition until the all of the containers are fully terminated by the container runtime.

As a result, since the pod is still considered ready, external controllers (e.g. the endpoint controller), utilize the pod ready condition and will continue to send traffic to these pods marked in terminal phase. These pods are often not ready to serve traffic since they are in progress of being shutdown.

This is regression from prior to 1.22, since prior to 1.22, kubelet did not send any pod/container conditions to the api-server for evicted pods.

xref: #105358 (comment) for difference between kubelet status for evicted pods before and after 1.22

One of the areas where this issue manifests is with Kubelet graceful shutdown. In 1.22+, if a pod takes a long time to shutdown, during the entire shutdown process the kubelet will report Ready=true condition, and traffic will be sent to the pod.

Additionally, If the node is shutdown before the pod fully terminates, the pod will be stuck in this state where Phase=Failed, but Ready=true. As a result, the pod will be seen incorrectly as ready by services and will be sent traffic (which the pod will likely be unable to handle due to either the pod being in progress of terminating or the node already being shutdown).

I suspect this is issue not only for graceful shutdown, but for any type of kubelet initiated eviction.

What did you expect to happen?

Kubelet should maintain the same behavior as prior to 1.22. Kubelet should report Ready=false condition at the same time as marking Phase=Failed for evicted pods. Having pods in terminal phases (i.e. Failed / Succeeded), but Ready=true should not be a possible status update from kubelet to api-server.

How can we reproduce it (as minimally and precisely as possible)?

Create a 1.22+ cluster with the following kubelet configuration

shutdownGracePeriod: 120s
shutdownGracePeriodCriticalPods: 10s

Create the following pod (pod blocks SIGTERM and has 60s termination grace period)

curl -s https://gist.githubusercontent.com/bobbypage/5ef251d4d2224983472bde01c7620f6f/raw/term_pod.yaml | kubectl apply -f -

Trigger a graceful shutdown (either by actually shutting down machine or by sending message to dbus)

# on the node where term_pod is scheduled
sudo dbus-send --system /org/freedesktop/login1 org.freedesktop.login1.Manager.PrepareForShutdown boolean:true

Observe that the pod reports phase=Failed, but Ready=true for the entire 60s duration of shutdown

$ kubectl get pods test-term -o yaml | gh gist create
- Creating gist...
✓ Created gist
https://gist.github.com/3f75d5677e23d5a90cb5bff3bb4e2caa
# status reports Ready=true, but phase=Failed
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-03-08T19:23:21Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-03-08T19:23:23Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-03-08T19:23:23Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-03-08T19:23:21Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5
    image: k8s.gcr.io/debian-base:v2.0.0
    imageID: k8s.gcr.io/debian-base@sha256:ebda8587ec0f49eb88ee3a608ef018484908cbc5aa32556a0d78356088c185d4
    lastState: {}
    name: main
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-03-08T19:23:23Z"
...
message: Pod was terminated in response to imminent node shutdown.
  phase: Failed
  reason: Terminated

Only after 60s (pod fully terminated), is condition Ready=false reported

# pod dump ~60s later
$ kubectl get pods test-term -o yaml | gh gist create
- Creating gist...
✓ Created gist
https://gist.github.com/bb16d400754ce885c263de4b8482f101

See the kubelet logs (full logs: https://gist.github.com/728df3b9c53ff3116bb4c2badbc1b34f)

# node shutdown kills the pod

Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410550     202 nodeshutdown_manager_linux.go:316] "Shutdown manager killing pod with gracePeriod" pod="default/test-term" gracePeriod=60
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410710     202 pod_workers.go:638] "Pod is being removed by the kubelet, begin teardown" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410738     202 pod_workers.go:888] "Processing pod event" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56 updateType=1
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410919     202 pod_workers.go:996] "Pod worker has observed request to terminate" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410984     202 kubelet.go:1719] "syncTerminatingPod enter" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411011     202 kubelet_pods.go:1429] "Generating pod status" pod="default/test-term"
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411108     202 kubelet_pods.go:1441] "Got phase for pod" pod="default/test-term" oldPhase=Running phase=Running
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411224     202 kubelet.go:1749] "Pod terminating with grace period" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56 gracePeriod=60
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411337     202 event.go:294] "Event occurred" object="default/test-term" fieldPath="spec.containers{main}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container main"
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411302     202 kuberuntime_container.go:718] "Killing container with a grace period override" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56 containerName="main" containerID="containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5" gracePeriod=60
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411391     202 kuberuntime_container.go:722] "Killing container with a grace period" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56 containerName="main" containerID="containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5" gracePeriod=60

# pod updated with Phase=Failed, Ready=true

Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.451647     202 status_manager.go:633] "Status for pod updated successfully" pod="default/test-term" statusVersion=3 status={Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:21 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:23 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:23 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:21 +0000 UTC Reason: Message:}] Message:Pod was terminated in response to imminent node shutdown. Reason:Terminated NominatedNodeName: HostIP:192.168.10.2 PodIP:10.244.1.2 PodIPs:[{IP:10.244.1.2}] StartTime:2022-03-08 19:23:21 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:main State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2022-03-08 19:23:23 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:k8s.gcr.io/debian-base:v2.0.0 ImageID:k8s.gcr.io/debian-base@sha256:ebda8587ec0f49eb88ee3a608ef018484908cbc5aa32556a0d78356088c185d4 ContainerID:containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5 Started:0xc0004fe900}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

# 60 seconds later, pod is updated with Phase=Failed, Ready=false

Mar 08 19:25:48 kind-worker kubelet[202]: I0308 19:25:48.274488     202 status_manager.go:633] "Status for pod updated successfully" pod="default/test-term" statusVersion=4 status={Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:21 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:25:48 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [main]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:25:48 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [main]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:21 +0000 UTC Reason: Message:}] Message:Pod was terminated in response to imminent node shutdown. Reason:Terminated NominatedNodeName: HostIP:192.168.10.2 PodIP:10.244.1.2 PodIPs:[{IP:10.244.1.2}] StartTime:2022-03-08 19:23:21 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:main State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2022-03-08 19:23:23 +0000 UTC,FinishedAt:2022-03-08 19:25:47 +0000 UTC,ContainerID:containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:k8s.gcr.io/debian-base:v2.0.0 ImageID:k8s.gcr.io/debian-base@sha256:ebda8587ec0f49eb88ee3a608ef018484908cbc5aa32556a0d78356088c185d4 ContainerID:containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5 Started:0xc001820c19}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

Anything else we need to know?

No response

Kubernetes version

Details
$ kubectl version
# paste output here

Cloud provider

Details n/a

OS version

Details
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Details

Container runtime (CRI) and version (if applicable)

Details

Related plugins (CNI, CSI, ...) and versions (if applicable)

Details

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type
No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions