kubelet: fix duplicated status updates at pod cleanup#21438
Conversation
|
/cc @smarterclayton |
|
Labelling this PR as size/M |
|
FYI, when deleting 100 pods on a node, the status update channel was completely flooded. |
There was a problem hiding this comment.
If the cached status is not found, oldStatus will be nil here, which I think will lead to a nil pointer dereference in copyStatus?
There was a problem hiding this comment.
ah I meant to set that to the pod.Status, but lost it during rebasing. Will fix it.
d531fce to
58087f8
Compare
|
GCE e2e test build/test passed for commit d531fce733d2416870192e03ac685b872097a7d5. |
|
LGTM once the nil pointer is fixed. |
Thanks for the review! It's fixed. |
|
GCE e2e test build/test passed for commit 58087f835b69b6427415ea5da6a185c1e36d1923. |
|
PR needs rebase |
cleanupTerminatedPods is responsible for checking whether a pod has been terminated and force a status update to trigger the pod deletion. However, this function is called in the periodic clenup routine, which runs every 2 seconds. In other words, it forces a status update for each non-running (and not yet deleted in the apiserver) pod. When batch deleting tens of pods, the rate of new updates surpasses what the status manager can handle, causing numerous redundant requests (and the status channel to be full). This change forces a status update only when detecting the DeletionTimestamp is set for a terminated pod. Note that for other non-terminated pods, the pod workers should be responsible for setting the correct status after killling all the containers.
58087f8 to
386453a
Compare
|
PR changed after LGTM, removing LGTM. |
|
Trivial rebase (only one line of comments was touched). Re-applying lgtm. |
|
GCE e2e build/test failed for commit 386453a. |
|
GCE e2e test build/test passed for commit 386453a. |
|
@k8s-bot test this Tests are more than 48 hours old. Re-running tests. |
|
GCE e2e test build/test passed for commit 386453a. |
|
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
|
GCE e2e test build/test passed for commit 386453a. |
|
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
|
GCE e2e test build/test passed for commit 386453a. |
|
Automatic merge from submit-queue |
Auto commit by PR queue bot
|
It seems that this PR somehow broke kubemark - the last two runs of kubemark are failing with "time out on deleting replication controllers", which seems clearly related to this PR ... |
|
@yujuhong - just to give you an evidence that, these are the logs from the failure: Note that the phase "Terminating RC ..." before you change took ~2s, now it's timing out on 2 minutes. BTW - this is happening only on kubemark, something strange is now happening in SimpleKubelet? @gmarek can potentially give you more details on what kubemark is using |
|
And to confirm - reverting this PR fixed the problem. |
I suspect the (non-)graceful termination in kubemark doesn't work expected because this PR only removes redundant pod deletion requests (which is a huge burden when batch deleting 100 pods) @Random-Liu has agreed to take a look and see what difference in kubemark and kubelet that'd cause pods not being terminated. |
|
@yujuhong - we are running exactly the same test (the same number of nodes, the same number of pods, etc.) in both: kubemark and real cluster. |
It has nothing to do with batching as you said :) I think the SimpleKubelet may be too simple and not doing things right -- in which case the duplicated, redundant deletion requests in the periodic cleanup routine may have just saved the day. |
|
Yeah - that's definitely possible. |
|
Alright, I believe I found the cause. Graceful termination does not work correctly in kubemark because of the fake docker client. After "stopping" a container, kubelet was unable to detect that the container has been stopped and thus did not consider the pod was fully terminated. We should fix the fake docker client. E.g., repeatedly killing the same container: |
|
@yujuhong Thanks! |
|
cc @fejta |
cleanupTerminatedPods is responsible for checking whether a pod has been
terminated and force a status update to trigger the pod deletion. However, this
function is called in the periodic clenup routine, which runs every 2 seconds.
In other words, it forces a status update for each non-running (and not yet
deleted in the apiserver) pod. When batch deleting tens of pods, the rate of
new updates surpasses what the status manager can handle, causing numerous
redundant requests (and the status channel to be full).
This change forces a status update only when detecting the DeletionTimestamp is
set for a terminated pod. Note that for other non-terminated pods, the pod
workers should be responsible for setting the correct status after killling all
the containers.