Make killing shims more resilient#4204
Conversation
|
Attaching some more relevant information that we found when debugging this issue and |
|
Containerd logs are full of |
|
I uploaded the stack trace dump from a containerd in this state here: https://gist.github.com/ashrayjain/f1bac2cc5bec2af5445268a3e1bc7fef |
Zyqsempai
left a comment
There was a problem hiding this comment.
Please sign your commit, otherwise, CI will fail.
|
Hey @Zyqsempai , I think I signed my commit already. Is there anything else needed to satisfy CI? |
|
You will need to rebase on master to pass CI once we merge #4206. Sorry--upstream change in golangci-lint broke our dev-tool-install script. |
c5e0ae6 to
32d081f
Compare
|
@estesp no problem, done! |
32d081f to
46ce63c
Compare
c4f0810 to
c3d0845
Compare
|
Build succeeded.
|
Codecov Report
@@ Coverage Diff @@
## master #4204 +/- ##
=======================================
Coverage 38.34% 38.34%
=======================================
Files 90 90
Lines 12728 12728
=======================================
Hits 4881 4881
Misses 7181 7181
Partials 666 666
Continue to review full report at Codecov.
|
|
Is this problem also relevant to the v2 shim? |
|
I spent some time trying to look into that late last week; my understanding is that the v2/ code doesn't have this same flow of waiting for a complete exit. My problem with the PR is that it effectively is changing the semantics of "waitForExit" to "keepTryingToKillForAwhile" :) I'm not sure if @crosbymichael has had a chance to look at this as he is more intimately familiar with this code. |
|
@estesp to provide some more info here. We ran the killsnoop bpf tool (https://github.com/iovisor/bcc/blob/master/tools/killsnoop.py) to inspect and track all the Here is the output from an occurrence of this issue. As you can see, containerd sent a Would this PR be acceptable if we made it clear that the method was retrying the kill as opposed to just waiting for exit? |
runtime/v1/shim/client/client.go
Outdated
There was a problem hiding this comment.
If kill doesn't work at this moment, could the caller retry kill shim again? It is reasonable to do retry if failed at first time.
For the mentioned case, if kill -9 fails at the first time, can you kill it by manually? And could it be possible reused pid here (same pid but it is for different container)?
There was a problem hiding this comment.
could you mind to check wchan result from ps axo pid,cmd,wchan if the shim is still hanging there?
There was a problem hiding this comment.
@fuweid yea, we have seen retrying the kill succeed when the first kill fails.
The manual kill succeeds.
31745 containerd-shim -namespace futex_wait_queue_me
is the output for one such shim.
There was a problem hiding this comment.
Thanks for the information. And the shim is the same shim? I mean the pid can be reused.
There was a problem hiding this comment.
yes i confirmed the shim is the same based on the container id it was associated with.
|
@estesp @BenTheElder @crosbymichael @fuweid is there anything else i can do to help push this forward? |
|
I spent a few minutes looking at the call path (services/tasks/local.go It does turn out that this code ( Seems maybe the right path is to make |
|
I am not sure why Back to the issue, for shim v1, the containerd doesn't retry If the So I think we should check the event handler in CRI plugin and what error we meet. Maybe related to #4198 because I check the log from https://gist.github.com/ashrayjain/f1bac2cc5bec2af5445268a3e1bc7fef and found @ashrayjain would you mind to provide more information about CRI plugin event log? like cc @estesp |
|
ping @ashrayjain |
|
Hey @fuweid, apologies for the delay. I'm waiting for another repro of this issue on our cluster to provide you with some more info on this. |
|
@fuweid @estesp So we just had another repro of this issue. Here is what we found: Based on the On trying to look at the containerd logs, we saw Based on how the code generates and For this pod, if we run There was no process on this host with pid 9768 at this time. Additionally, for the sandbox in question, we also have which is at the exact time we observe the first Trying to find the "current" pid 9750 on the host, we find So this is actually a thread being used by an unrelated container. So in summary, what appears to have happened here is that containerd tried to kill the shim with pid 9768/9750 for sandbox @fuweid it seems when you first suggested that the pid might be getting reused, you were on to something ;) Do you folks have ideas on how to handle this situation better? |
|
Was the part of the output of 'crictl inspectp' ? I wonder if the start time can be obtained and disambiguate whether the same process id was reused. |
|
It was part of the |
|
@ashrayjain OK, It is pid-reuse issue... 😂 I think @tedyu 's idea is reasonable to prevent dead loop to But it there any log related to |
fba79c8 to
e2c200a
Compare
|
Build succeeded.
|
e2c200a to
23e069e
Compare
|
Build succeeded.
|
|
@fuweid would appreciate a look here when you get a chance. Thanks for continuing to push this forward, I think we are almost there with the fix :) |
Currently, we send a single SIGKILL to the shim process once and then we spin in a loop where we use kill(pid, 0) to detect when the pid has disappeared completely. Unfortunately, this has a race condition since pids can be reused causing us to spin in an infinite loop when that happens. This adds a timeout to this loop which logs a warning and exits the infinite loop. Signed-off-by: Ashray Jain <ashrayj@palantir.com>
23e069e to
3e95727
Compare
|
Additionally, i've fixed the initial delay issue, so now there won't be an initial 10ms wait |
|
Build succeeded.
|
|
LGTM This is fine for v1 but the proper fix is to move to v2 for the runtime shim :) |
|
Thanks for your first PR @ashrayjain ! |
|
@ashrayjain Thanks for your patience. |
|
@containerd/containerd-release is it good to cherry-pick into release/1.3? |
|
@fuweid is there another 1.3.x release scheduled in the near future? |
|
@ashrayjain #4307 has been merged. No sure about the 1.3.5 release schedule :) |
Currently, we send a single SIGKILL to the shim process
once and then we spin in a loop where we use kill(pid, 0)
to detect when the pid has disappeared completely.
Unfortunately, this has a race condition since pids can be reused causing us
to spin in an infinite loop when that happens.
This adds a timeout to this loop which logs a warning and exits the infinite loop.
This fixes containerd/cri#1427