Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker#102344
Merged
Conversation
9679e37 to
54021e0
Compare
Contributor
Author
|
/priority critical-urgent Race conditions in the kubelet mean that quickly deleted pods can leave dangling resources for significant amounts of time (although this is enough of a change that we shouldn't rush to merge it until we're confident it's both correct, safe, and improves testing overall) |
Member
|
/triage accepted |
Member
|
Doc is https://docs.google.com/document/d/1DvAmqp9CV8i4_zYvNdDWZW-V0FMk1NbZ8BAOvLRD3Ds/edit# (link in first comment is the RH internal draft) |
54021e0 to
e4d4b3b
Compare
This was referenced Jul 19, 2021
This was referenced Jul 21, 2021
This was referenced Jul 27, 2021
Closed
bobbypage
added a commit
to bobbypage/kubernetes
that referenced
this pull request
Nov 4, 2021
* Bump the pod status and node status update timeouts to avoid flakes * Add a small delay after dbus restart to ensure dbus has enough time to restart to startup prior to sending shutdown signal * Change check of pod being terminated by graceful shutdown. Previously, the pod phase was checked to see if it was `Failed` and the pod reason string matched. This logic needs to change after 1.22 graceful node shutdown change introduced in PR kubernetes#102344 which changed behavior to no longer put the pods into a failed phase. Instead, the test now checks that containers are not ready, and the pod status message and reason are set appropriately. Signed-off-by: David Porter <david@porter.me>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A number of race conditions exist when pods are terminated early in their lifecycle because components in the kubelet need to know "no running containers" or "containers can't be started from now on" but were relying on outdated state (they don't know whether the pod is setting up, tearing down, or can never be started again).
Only the pod worker knows whether containers are being started for a given pod, which is required to know when a pod is "terminated" (no running containers, none coming). Move that responsibility and podKiller function into the pod workers, and have everything that was killing the pod go into the UpdatePod loop. Have pod workers remain running until the pod is ready to be deleted in etcd (PodResourcesAreReclaimed would return true). Split syncPod into three phases - setup, terminate containers, and cleanup pod - and have transitions between those methods be visible to other components. After this change, to kill a pod you tell the pod worker to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker exposes methods that allow other loops to know when to set up or tear down resources based on the state of the pod - these methods remove the possibility of race conditions by ensuring a single component is responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
In addition, removing containers now no longer blocks final pod deletion in the API server and are handled as background cleanup.
See https://docs.google.com/document/d/1DvAmqp9CV8i4_zYvNdDWZW-V0FMk1NbZ8BAOvLRD3Ds/edit# for details
TODO:
What type of PR is this?
/kind bug
/kind flake
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: