Prevent pods from defaulting to zero second grace periods#102025
Prevent pods from defaulting to zero second grace periods#102025smarterclayton wants to merge 5 commits into
Conversation
|
/assign @liggitt |
|
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
|
/priority important-soon |
|
Wow, we have e2e tests that are racing on the same pod name: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/102025/pull-kubernetes-e2e-kind-ipv6/1393342213345251328 These tests were force deleting pods, then starting a pod of the same name and execing to it - which wouldn't necessarily guarantee that an exec session was even going to the new pod (at least, historically it wouldn't). I will fix the e2e tests that assume that deletion like this is safe (it isn't) |
2bd6593 to
372e30e
Compare
|
/hold for reviews and prevent accidental merging :) |
|
This implies we have a kubelet bug where some events are being lost / the worker is saturated, or a call is failing in containerd. Needs investigation. This is yet another reason why force delete in e2e hides SLO failures in underlying components - I would expect subsecond container kills from time the delete is received, even if volume cleanup takes longer. Something very wrong / inadequate is happening in code + environment. @rphillips can you help me find an owner to run down what would cause the kubelet to get this bogged down (might be container runtime, hard to say). |
c2c0045 to
712ea78
Compare
|
@smarterclayton it needs some tuning
I think we should merge this as soon as possible |
|
Talking with jordan he suggested three things:
all of which are reasonable as we are changing the behavior of the system and it might impact someone creating a pod and then deleting and expect immediate deletion (due to the implicit zero). I'm going to draft a KEP which clarifies the behavior and brings in pod safety rules (still the original guiding element). |
The zero grace period is a special value and is intended for exceptional use (to break the safety guarantees when a node is partitioned). When spec.terminationGracePeriodSeconds is zero, pods are instantly deleted without waiting for the kubelet to ensure the process is not running. The intent of this default was not to allow users to bypass that process (since the normal behavior of a pod is not to violate kubelet shutdown) and the documentation was written as such. Users who wish to bypass the kubelet's participation must set gracePeriod to zero on the delete options call. The value one is substituted when users specify 0, resulting in the kubelet participating in the deletion process and only a minor delay.
The kubelet owns termination of scheduled pods. In integration tests, run pod deletion on pods that have requested termination as necessary.
Previously these tests were relying on force deletion, which allowed new pods with the same name to be created before the pods on the nodes completed. This meant that a test that did: 1. create pod foo/bar 2. exec pod foo/bar "run some test" 3. delete pod foo/bar (force) 4. create pod foo/bar 5. exec pod foo/bar "some other test" could end up execing into the terminating container if the delete notification was delayed on watch.
Now that pods are not force deleted, some tests that wait for pods to be deleted take a realistic amount of time, which is about a minute from deletion request to kubelet finally being able to delete the pod from etcd due to latency in the status pipeline in dense e2e tests. Increase the timeout of this test to two minutes.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hrm, there's a very interesting bug in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/102025/pull-kubernetes-e2e-kind-ipv6/1498726712261742592: The status update for agnhost is delivered to the kubelet (we detect that container agnhost started and thus transition the phase to running) but then the kubelet logs show that the status update was "unchanged" at 18:55:34.494020 and then at 18:55:56.088086 we sent a patch that shows the container in waiting. I think this is head of line blocking that #107897 fixes (we put status updates in a queue and read them in order, but we really only need the most recent status). If so, unrelated to this PR. |
|
@smarterclayton: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
|
@k8s-triage-robot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/reopen This is still relevant. I ran into this unexpected behavior of |
|
@pohly: Reopened this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@smarterclayton: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
|
@k8s-triage-robot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/reopen |
|
@pohly: Reopened this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
|
@k8s-triage-robot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
A zero second grace period on pods is a special value and is intended for exceptional use (to break the safety guarantees when a node is partitioned). When spec.terminationGracePeriodSeconds (which overrides the default of 30s) is zero, pods are instantly deleted without waiting for the kubelet to ensure the process is not running. The intent of this default was not to allow users to bypass that process (since the normal behavior of a pod is not to bypass kubelet shutdown) and the documentation was written as such. Users who wish to bypass the kubelet's participation must set gracePeriod to zero on the delete options call.
With this change the value 1 second is substituted when users specify 0, resulting in the kubelet participating in the deletion process and only a minor delay in practice.
The controlling design for pod safety is https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md and I believe we simply missed this during the early review and no one ever blew themselves up with this footgun. We did not intend to offer the footgun as a feature of defaulting as described here in the doc:
That matches the value we placed in the field godoc:
The intent was the default is the minimum, that the kubelet would be involved in deletion, and that gracePeriod=0 on a delete operation was the only way a user would be able to perform force delete. Unfortunately we didn't actually handle 0 specially at the time we implemented this and it has remained ambiguous since then.
As this changes the observed behavior of the system (prioritizing safety and the documented behavior over the unsafe behavior), this is also an API change. Kube is consistent by default, and bypassing the kubelet is both a safety (accidental consistency loss) and operational hazard (deleting a large number of pods without the kubelet means that resources are still allotted to those workers and could take an arbitrary amount of time to clean up).
Discovered while doing a review of clusters in the wild - some workloads were setting this thinking that it simply meant "as fast as possible".
/kind bug
/kind api-change
/sig node