Skip to content

runc.v1 and runc.v2 doesn't cleanup containers after shim dies unexpectedly. #3199

@Random-Liu

Description

@Random-Liu

Shim V1

$ crictl runp sandbox.json 
e4d06a0eaca589d400477caaf7603c1f92b052978445dee646d0e215f86208ef
$ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT
e4d06a0eaca58       2 seconds ago       Ready               nginx-sandbox1      default1            1
$ ps aux | grep containerd-shim
root     232874  0.0  0.0 108744 10172 pts/0    Sl   18:42   0:00 containerd-shim -namespace k8s.io -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/e4d06a0eaca589d400477caaf7603c1f92b052978445dee646d0e215f86208ef -address /run/containerd/containerd.sock -containerd-binary /usr/local/google/home/lantaol/workspace/src/github.com/containerd/containerd/bin/containerd -debug
root     232962  0.0  0.0  12788   948 pts/1    S+   18:42   0:00 grep containerd-shim
$ ps aux | grep pause
root     232891  0.0  0.0   1024     4 ?        Ss   18:42   0:00 /pause
root     232977  0.0  0.0  12788   956 pts/1    S+   18:42   0:00 grep pause
$ sudo kill -9 232874
$ ps aux | grep pause
root     232990  0.0  0.0  12788   988 pts/1    S+   18:42   0:00 grep pause
$ crictl pods
POD ID              CREATED              STATE               NAME                NAMESPACE           ATTEMPT
e4d06a0eaca58       About a minute ago   NotReady            nginx-sandbox1      default1            1

And containerd log:

INFO[2019-04-09T18:42:38.825242408-07:00] shim reaped                                   id=e4d06a0eaca589d400477caaf7603c1f92b052978445dee646d0e215f86208ef
WARN[2019-04-09T18:42:38.825348556-07:00] cleaning up after killed shim                 id=e4d06a0eaca589d400477caaf7603c1f92b052978445dee646d0e215f86208ef namespace=k8s.io
DEBU[2019-04-09T18:42:38.993306003-07:00] event published                               ns=k8s.io topic=/tasks/exit type=containerd.events.TaskExit
DEBU[2019-04-09T18:42:38.993488984-07:00] Received containerd event timestamp - 2019-04-10 01:42:38.993277981 +0000 UTC, namespace - "k8s.io", topic - "/tasks/exit" 
INFO[2019-04-09T18:42:38.993646968-07:00] TaskExit event &TaskExit{ContainerID:e4d06a0eaca589d400477caaf7603c1f92b052978445dee646d0e215f86208ef,ID:e4d06a0eaca589d400477caaf7603c1f92b052978445dee646d0e215f86208ef,Pid:232891,ExitStatus:137,ExitedAt:2019-04-10 01:42:38.993199959 +0000 UTC,XXX_unrecognized:[],} 
DEBU[2019-04-09T18:42:38.993855090-07:00] event published                               ns=k8s.io topic=/tasks/delete type=containerd.events.TaskDelete

Shim V2

$ crictl runp sandbox.json 
27f9777addd276ba86c6adef159c8d4fb786484a0630ac674381145f0fd66c9b
$ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT
27f9777addd27       3 seconds ago       Ready               nginx-sandbox1      default1            1
$ ps aux | grep containerd-shim
root     230730  0.1  0.0 110240 10032 pts/0    Sl   18:39   0:00 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 27f9777addd276ba86c6adef159c8d4fb786484a0630ac674381145f0fd66c9b -address /run/containerd/containerd.sock -publish-binary /usr/local/google/home/lantaol/workspace/src/github.com/containerd/containerd/bin/containerd
root     230819  0.0  0.0  12788   984 pts/1    S+   18:39   0:00 grep containerd-shim
$ ps aux | grep pause
root     230749  0.0  0.0   1024     4 ?        Ss   18:39   0:00 /pause
root     230828  0.0  0.0  12788   944 pts/1    S+   18:39   0:00 grep pause
$ sudo kill -9 230730
$ ps aux | grep pause
root     230749  0.0  0.0   1024     4 ?        Ss   18:39   0:00 /pause
root     230837  0.0  0.0  12788  1000 pts/1    S+   18:39   0:00 grep pause
$ ps aux | grep containerd-shim
root     230841  0.0  0.0  12788   996 pts/1    S+   18:39   0:00 grep containerd-shim
$ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT
27f9777addd27       35 seconds ago      Ready               nginx-sandbox1      default1            1

In shim v1, we do cleanup after containerd-shim process exits https://github.com/containerd/containerd/blob/master/runtime/v1/shim/client/client.go#L98.

In shim v2, shim start is a short running process that we can't rely on. However, we can probably cleanup based on ttrpc connection, basically putting the cleanup logic in client.OnClose.

Basically, when disconnected, we can:

  1. SIGKILL the shim process to make sure it is dead. We can get shim pid from Connect;
  2. Call shim delete to do cleanup (same with V1).

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions