-
Notifications
You must be signed in to change notification settings - Fork 3.8k
fix init process exited event lost #10603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @ningmingxiao. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
I add some sleep can reproduce it crictl create pod |
1cd8061 to
6672b8e
Compare
4009103 to
110d5b5
Compare
|
/ok-to-test |
d393cd4 to
f2eb6e9
Compare
done! thank you. |
2d91851 to
ba07947
Compare
|
@ningmingxiao can you explain why the exits are being lost/how this fixes it? The original code skips exit events for init processes for as long as there is a pending exec for that container, and stores those skipped events in This is because exits for the init process shouldn't be sent while there are still unresolved pending execs. Those skipped events should then be processed after the pending execs are resolved. Simple emitting those exit events where this change does (before the pending execs have been resolved) doesn't look correct (but I need to take a better look). |
|
I also don't really understand your repro instructions: you're adding a Trying to reproduce this, after killing the init process, the container correctly shows as stopped – $ sudo ctr run -d docker.io/library/alpine:latest test sh -c "while true; do sleep 1; done"
$ sudo ctr t exec -d --exec-id 1 test sleep 120
[this hangs because of the introduced sleep][on another terminal]
$ ps aux | grep sleep
root 46277 0.0 0.0 1720 896 ? Ss 12:44 0:00 sh -c while true; do sleep 1; done
root 46381 0.0 0.0 14060 4224 pts/10 S+ 12:44 0:00 sudo ctr t exec -d --exec-id 1 borkeee sleep 120
root 46383 0.0 0.0 14060 1572 pts/9 Ss 12:44 0:00 sudo ctr t exec -d --exec-id 1 borkeee sleep 120
root 46384 0.0 0.2 1987836 23040 pts/9 Sl+ 12:44 0:00 ctr t exec -d --exec-id 1 borkeee sleep 120
root 46404 0.0 0.0 1704 768 ? Ss 12:44 0:00 sleep 120
$ kill -9 46277
$ sudo ctr t ls
TASK PID STATUS
test 46277 STOPPED |
containerd log can't see 1840091 exited log. |
|
@ningmingxiao so this happens if you kill the main process while the exec is starting/ |
I add some log find init exit event will add in skipped and handleProcessExit never run. exited is nil |
|
Where are the logs for the exec started in #10603 (comment)? |
|
another way reproduce don't add sleep crictl ps show container still running. |
96bf2c6 to
7cec07e
Compare
|
kill init children's process when receive init process is exited. and send event later. @corhere @laurazard |
65bf79f to
8868f15
Compare
8077774 to
00ac973
Compare
75369cf to
39d2441
Compare
|
|
||
| func (s *service) processExits() { | ||
| for e := range s.ec { | ||
| s.killInitChildren(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this change alone fixes the issue. Even if we kill the execs when the init exits, what's there to prevent new execs from being consistently started (and thereby incrementing s.pendingExecs[c], never allowing us to process the init's exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after runc kill --all -9 , new exec will return error.
[root@LIN-FB738BFD367 pods]# crictl exec 6318 date
FATA[0000] execing command in container: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "d116cb9da07e0d27d9ddbbc8bd469bbc994adb67e12cbba86cce549148650e31": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
after runc kill -all -9 runc container status is stopped, runc exec will fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about change handleProcessExit to sendProcessExit? we just need init exited is last sent. we should call p.SetExited(e.Status) as soon as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd have to dig into what the consequences of calling p.SetExited (potentially) long before sending the exit event. Does that mean that if I check the shim status it'll say it's dead, but it hasn't emitted the init exit yet (potentially)? Does that break any kind of explicit (or implicit) contract?
39d2441 to
106e75f
Compare
Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>
106e75f to
0349d8b
Compare
@laurazard after this commit 892dc54
we find init exited events sometimes will be dropped.
fix:#10589
@imo-ininder