Skip to content

Cleanup dead v2 shim.#3206

Merged
dmcgowan merged 1 commit intocontainerd:masterfrom
Random-Liu:cleanup-after-deadshim-v2
May 10, 2019
Merged

Cleanup dead v2 shim.#3206
dmcgowan merged 1 commit intocontainerd:masterfrom
Random-Liu:cleanup-after-deadshim-v2

Conversation

@Random-Liu
Copy link
Copy Markdown
Member

@Random-Liu Random-Liu commented Apr 11, 2019

Fixes #3199
Depends on containerd/ttrpc#35.

When containerd-shim is killed when containerd is running:

INFO[2019-04-11T10:39:37.413880563-07:00] shim disconnected                             id=753143c31aa85fe7f1424f1e13cf1891510d377bbc86b8daa520c72b084051b6
WARN[2019-04-11T10:39:37.414006366-07:00] cleaning up after shim disconnected           id=753143c31aa85fe7f1424f1e13cf1891510d377bbc86b8daa520c72b084051b6 namespace=k8s.io
INFO[2019-04-11T10:39:37.414034377-07:00] cleaning up dead shim                        
WARN[2019-04-11T10:39:37.581877377-07:00] cleanup warnings time="2019-04-11T10:39:37-07:00" level=info msg="starting signal loop" namespace=k8s.io pid=75143 
DEBU[2019-04-11T10:39:37.582387101-07:00] event published                               ns=k8s.io topic=/tasks/exit type=containerd.events.TaskExit
DEBU[2019-04-11T10:39:37.582505707-07:00] event published                               ns=k8s.io topic=/tasks/delete type=containerd.events.TaskDelete

When containerd-shim is killed when containerd is down, after containerd restarts:

DEBU[2019-04-11T10:40:35.507193010-07:00] loading tasks in namespace                    namespace=k8s.io
WARN[2019-04-11T10:40:35.507404450-07:00] cleaning up after shim disconnected           id=63d6638d17014c040f4a979bfd854fb541deb116e785c7e1332c1c7bd47b783e namespace=k8s.io
INFO[2019-04-11T10:40:35.507420039-07:00] cleaning up dead shim                        
DEBU[2019-04-11T10:40:35.611805828-07:00] garbage collected                             d=4.998416ms
WARN[2019-04-11T10:40:35.676519575-07:00] cleanup warnings time="2019-04-11T10:40:35-07:00" level=info msg="starting signal loop" namespace=k8s.io pid=75740 
DEBU[2019-04-11T10:40:35.677121491-07:00] event published                               ns=k8s.io topic=/tasks/exit type=containerd.events.TaskExit
DEBU[2019-04-11T10:40:35.677203935-07:00] event published                               ns=k8s.io topic=/tasks/delete type=containerd.events.TaskDelete

I'll add integration test in the CRI plugin for the above 2 cases. Not sure whether we want these kind of tests in the containerd integration test, given that we may need to pgrep containerd-shim and pkill containerd-shim.

Signed-off-by: Lantao Liu lantaol@google.com

@Random-Liu Random-Liu added this to the 1.3 milestone Apr 11, 2019
@Random-Liu Random-Liu force-pushed the cleanup-after-deadshim-v2 branch from b3019e3 to 5e80dad Compare April 11, 2019 18:06
@codecov-io
Copy link
Copy Markdown

codecov-io commented Apr 11, 2019

Codecov Report

Merging #3206 into master will decrease coverage by 4.76%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3206      +/-   ##
==========================================
- Coverage   44.63%   39.87%   -4.77%     
==========================================
  Files         113       76      -37     
  Lines       12164    10106    -2058     
==========================================
- Hits         5430     4030    -1400     
+ Misses       5899     5510     -389     
+ Partials      835      566     -269
Flag Coverage Δ
#linux ?
#windows 39.87% <ø> (ø) ⬆️
Impacted Files Coverage Δ
snapshots/native/native.go 1.79% <0%> (-41.26%) ⬇️
archive/tar.go 16.99% <0%> (-26.8%) ⬇️
metadata/snapshot.go 21.53% <0%> (-24.28%) ⬇️
cio/io.go 1.52% <0%> (-21.38%) ⬇️
content/local/writer.go 57.69% <0%> (-0.97%) ⬇️
gc/scheduler/scheduler.go 66.34% <0%> (-0.97%) ⬇️
oci/spec_opts.go 30.33% <0%> (-0.25%) ⬇️
mount/temp_unix.go
sys/reaper_linux.go
services/server/server_linux.go
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a17c809...5c9811d. Read the comment docs.

@Random-Liu Random-Liu force-pushed the cleanup-after-deadshim-v2 branch from 5e80dad to c92a4dc Compare April 11, 2019 18:52
@Random-Liu
Copy link
Copy Markdown
Member Author

Random-Liu commented Apr 11, 2019

Note that we send out TaskExit and TaskDelete event in containerd instead of the shim delete command, because:

  1. The event is generic enough for containerd to send out;
  2. Sending event from the shim added a dependency of the event ttrpc service to the task plugin.

I can move the events into shim delete if people think 2) above is fine.

@jterry75
Copy link
Copy Markdown
Contributor

@Random-Liu - In general this is a great change. We have noticed this as well if the shim panic's for any reason CRI/Containerd struggle to clean up this state. I appreciate the change here! One thing I am concerned about but don't quite know how to fix is when we have a multi-container shim we need to send exits for all containers in the shim. But TaskManager doesn't know its a multi-container shim right? How do we ensure that we get exits for all containers?

@Random-Liu Random-Liu force-pushed the cleanup-after-deadshim-v2 branch from c92a4dc to b010dc4 Compare April 26, 2019 22:09
@Random-Liu
Copy link
Copy Markdown
Member Author

One thing I am concerned about but don't quite know how to fix is when we have a multi-container shim we need to send exits for all containers in the shim. But TaskManager doesn't know its a multi-container shim right? How do we ensure that we get exits for all containers?

I think the cleanup is done container by container, so it should no matter it is a multi-container shim or per-container shim.

Note that we don't try to kill shim here like #3226, because we know there might be multi-container shim.

One downside is that if the shim ttrpc server stops working, but the shim process is still running, containerd will cleanup all the containers, but leave the shim process running. We have no solution for that yet.

@Random-Liu
Copy link
Copy Markdown
Member Author

@crosbymichael Can you take a look at this? Without this, once shim process exits, container will be left running forever.

Signed-off-by: Lantao Liu <lantaol@google.com>
@Random-Liu Random-Liu force-pushed the cleanup-after-deadshim-v2 branch from b010dc4 to 5c9811d Compare April 30, 2019 20:45
@dmcgowan dmcgowan requested a review from crosbymichael May 6, 2019 22:41
@crosbymichael
Copy link
Copy Markdown
Member

LGTM

@crosbymichael
Copy link
Copy Markdown
Member

ping @containerd/containerd-maintainers

Copy link
Copy Markdown
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runc.v1 and runc.v2 doesn't cleanup containers after shim dies unexpectedly.

5 participants