What happened?
In 1.22+, when kubelet evicts a pod, it updates the pod status to api-server marking the pod phase as failed / succeeded. However, the kubelet continues to report Ready=true pod condition until the all of the containers are fully terminated by the container runtime.
As a result, since the pod is still considered ready, external controllers (e.g. the endpoint controller), utilize the pod ready condition and will continue to send traffic to these pods marked in terminal phase. These pods are often not ready to serve traffic since they are in progress of being shutdown.
This is regression from prior to 1.22, since prior to 1.22, kubelet did not send any pod/container conditions to the api-server for evicted pods.
xref: #105358 (comment) for difference between kubelet status for evicted pods before and after 1.22
One of the areas where this issue manifests is with Kubelet graceful shutdown. In 1.22+, if a pod takes a long time to shutdown, during the entire shutdown process the kubelet will report Ready=true condition, and traffic will be sent to the pod.
Additionally, If the node is shutdown before the pod fully terminates, the pod will be stuck in this state where Phase=Failed, but Ready=true. As a result, the pod will be seen incorrectly as ready by services and will be sent traffic (which the pod will likely be unable to handle due to either the pod being in progress of terminating or the node already being shutdown).
I suspect this is issue not only for graceful shutdown, but for any type of kubelet initiated eviction.
What did you expect to happen?
Kubelet should maintain the same behavior as prior to 1.22. Kubelet should report Ready=false condition at the same time as marking Phase=Failed for evicted pods. Having pods in terminal phases (i.e. Failed / Succeeded), but Ready=true should not be a possible status update from kubelet to api-server.
How can we reproduce it (as minimally and precisely as possible)?
Create a 1.22+ cluster with the following kubelet configuration
shutdownGracePeriod: 120s
shutdownGracePeriodCriticalPods: 10s
Create the following pod (pod blocks SIGTERM and has 60s termination grace period)
curl -s https://gist.githubusercontent.com/bobbypage/5ef251d4d2224983472bde01c7620f6f/raw/term_pod.yaml | kubectl apply -f -
Trigger a graceful shutdown (either by actually shutting down machine or by sending message to dbus)
# on the node where term_pod is scheduled
sudo dbus-send --system /org/freedesktop/login1 org.freedesktop.login1.Manager.PrepareForShutdown boolean:true
Observe that the pod reports phase=Failed, but Ready=true for the entire 60s duration of shutdown
$ kubectl get pods test-term -o yaml | gh gist create
- Creating gist...
✓ Created gist
https://gist.github.com/3f75d5677e23d5a90cb5bff3bb4e2caa
# status reports Ready=true, but phase=Failed
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-03-08T19:23:21Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-03-08T19:23:23Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-03-08T19:23:23Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-03-08T19:23:21Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5
image: k8s.gcr.io/debian-base:v2.0.0
imageID: k8s.gcr.io/debian-base@sha256:ebda8587ec0f49eb88ee3a608ef018484908cbc5aa32556a0d78356088c185d4
lastState: {}
name: main
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2022-03-08T19:23:23Z"
...
message: Pod was terminated in response to imminent node shutdown.
phase: Failed
reason: Terminated
Only after 60s (pod fully terminated), is condition Ready=false reported
# pod dump ~60s later
$ kubectl get pods test-term -o yaml | gh gist create
- Creating gist...
✓ Created gist
https://gist.github.com/bb16d400754ce885c263de4b8482f101
See the kubelet logs (full logs: https://gist.github.com/728df3b9c53ff3116bb4c2badbc1b34f)
# node shutdown kills the pod
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410550 202 nodeshutdown_manager_linux.go:316] "Shutdown manager killing pod with gracePeriod" pod="default/test-term" gracePeriod=60
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410710 202 pod_workers.go:638] "Pod is being removed by the kubelet, begin teardown" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410738 202 pod_workers.go:888] "Processing pod event" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56 updateType=1
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410919 202 pod_workers.go:996] "Pod worker has observed request to terminate" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.410984 202 kubelet.go:1719] "syncTerminatingPod enter" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411011 202 kubelet_pods.go:1429] "Generating pod status" pod="default/test-term"
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411108 202 kubelet_pods.go:1441] "Got phase for pod" pod="default/test-term" oldPhase=Running phase=Running
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411224 202 kubelet.go:1749] "Pod terminating with grace period" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56 gracePeriod=60
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411337 202 event.go:294] "Event occurred" object="default/test-term" fieldPath="spec.containers{main}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container main"
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411302 202 kuberuntime_container.go:718] "Killing container with a grace period override" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56 containerName="main" containerID="containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5" gracePeriod=60
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.411391 202 kuberuntime_container.go:722] "Killing container with a grace period" pod="default/test-term" podUID=94f28df9-3978-4eb8-b88f-7f3b5a0aab56 containerName="main" containerID="containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5" gracePeriod=60
# pod updated with Phase=Failed, Ready=true
Mar 08 19:24:47 kind-worker kubelet[202]: I0308 19:24:47.451647 202 status_manager.go:633] "Status for pod updated successfully" pod="default/test-term" statusVersion=3 status={Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:21 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:23 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:23 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:21 +0000 UTC Reason: Message:}] Message:Pod was terminated in response to imminent node shutdown. Reason:Terminated NominatedNodeName: HostIP:192.168.10.2 PodIP:10.244.1.2 PodIPs:[{IP:10.244.1.2}] StartTime:2022-03-08 19:23:21 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:main State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2022-03-08 19:23:23 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:k8s.gcr.io/debian-base:v2.0.0 ImageID:k8s.gcr.io/debian-base@sha256:ebda8587ec0f49eb88ee3a608ef018484908cbc5aa32556a0d78356088c185d4 ContainerID:containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5 Started:0xc0004fe900}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
# 60 seconds later, pod is updated with Phase=Failed, Ready=false
Mar 08 19:25:48 kind-worker kubelet[202]: I0308 19:25:48.274488 202 status_manager.go:633] "Status for pod updated successfully" pod="default/test-term" statusVersion=4 status={Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:21 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:25:48 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [main]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:25:48 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [main]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2022-03-08 19:23:21 +0000 UTC Reason: Message:}] Message:Pod was terminated in response to imminent node shutdown. Reason:Terminated NominatedNodeName: HostIP:192.168.10.2 PodIP:10.244.1.2 PodIPs:[{IP:10.244.1.2}] StartTime:2022-03-08 19:23:21 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:main State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2022-03-08 19:23:23 +0000 UTC,FinishedAt:2022-03-08 19:25:47 +0000 UTC,ContainerID:containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:k8s.gcr.io/debian-base:v2.0.0 ImageID:k8s.gcr.io/debian-base@sha256:ebda8587ec0f49eb88ee3a608ef018484908cbc5aa32556a0d78356088c185d4 ContainerID:containerd://2ceaa48f684dd1550bf67204a36caa6f5585e996881d5d0e232a1badd23c7cf5 Started:0xc001820c19}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
Anything else we need to know?
No response
Kubernetes version
Details
$ kubectl version
# paste output here
Cloud provider
Details
n/a
OS version
Details
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Details
Container runtime (CRI) and version (if applicable)
Details
Related plugins (CNI, CSI, ...) and versions (if applicable)
Details
What happened?
In 1.22+, when kubelet evicts a pod, it updates the pod status to api-server marking the pod phase as failed / succeeded. However, the kubelet continues to report Ready=true pod condition until the all of the containers are fully terminated by the container runtime.
As a result, since the pod is still considered ready, external controllers (e.g. the endpoint controller), utilize the pod ready condition and will continue to send traffic to these pods marked in terminal phase. These pods are often not ready to serve traffic since they are in progress of being shutdown.
This is regression from prior to 1.22, since prior to 1.22, kubelet did not send any pod/container conditions to the api-server for evicted pods.
xref: #105358 (comment) for difference between kubelet status for evicted pods before and after 1.22
One of the areas where this issue manifests is with Kubelet graceful shutdown. In 1.22+, if a pod takes a long time to shutdown, during the entire shutdown process the kubelet will report Ready=true condition, and traffic will be sent to the pod.
Additionally, If the node is shutdown before the pod fully terminates, the pod will be stuck in this state where Phase=Failed, but Ready=true. As a result, the pod will be seen incorrectly as ready by services and will be sent traffic (which the pod will likely be unable to handle due to either the pod being in progress of terminating or the node already being shutdown).
I suspect this is issue not only for graceful shutdown, but for any type of kubelet initiated eviction.
What did you expect to happen?
Kubelet should maintain the same behavior as prior to 1.22. Kubelet should report Ready=false condition at the same time as marking Phase=Failed for evicted pods. Having pods in terminal phases (i.e. Failed / Succeeded), but Ready=true should not be a possible status update from kubelet to api-server.
How can we reproduce it (as minimally and precisely as possible)?
Create a 1.22+ cluster with the following kubelet configuration
Create the following pod (pod blocks SIGTERM and has 60s termination grace period)
Trigger a graceful shutdown (either by actually shutting down machine or by sending message to dbus)
Observe that the pod reports phase=Failed, but Ready=true for the entire 60s duration of shutdown
Only after 60s (pod fully terminated), is condition Ready=false reported
See the kubelet logs (full logs: https://gist.github.com/728df3b9c53ff3116bb4c2badbc1b34f)
Anything else we need to know?
No response
Kubernetes version
Details
Cloud provider
Details
n/aOS version
Details
Install tools
Details
Container runtime (CRI) and version (if applicable)
Details
Related plugins (CNI, CSI, ...) and versions (if applicable)
Details