-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Description
I am using containerd as K8S container runtime. It works fine with my configuration, but after a node reboot, it will take hours to wait containerd to finish startup.
Looks like it hangs on loading plugin io.containerd.runtime.v2.task for hours.
Steps to reproduce the issue:
- Prepare a node with containerd as kubelet runtime.
- Make sure there are workloads been scheduled and running on the node.
- Reboot the node via
reboot now - Containerd stuck on starting up.
Describe the results you received:
From the log you can see that it takes about 100 seconds for containerd to cleanup the first dead shim.
So I guess the more workload I have, the longer it takes to startup after reboot.
containerd logs
$ systemctl status containerd
INFO[2021-06-11T11:18:11.556160877+08:00] starting containerd revision=36cc874494a56a253cd181a1a685b44b58a2e34a version=v1.5.2
INFO[2021-06-11T11:18:11.642627723+08:00] loading plugin "io.containerd.content.v1.content"... type=io.containerd.content.v1
INFO[2021-06-11T11:18:11.642838029+08:00] loading plugin "io.containerd.snapshotter.v1.devmapper"... type=io.containerd.snapshotter.v1
WARN[2021-06-11T11:18:11.643056000+08:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
INFO[2021-06-11T11:18:11.643122664+08:00] loading plugin "io.containerd.snapshotter.v1.native"... type=io.containerd.snapshotter.v1
INFO[2021-06-11T11:18:11.643284601+08:00] loading plugin "io.containerd.snapshotter.v1.overlayfs"... type=io.containerd.snapshotter.v1
INFO[2021-06-11T11:18:11.643960691+08:00] loading plugin "io.containerd.snapshotter.v1.zfs"... type=io.containerd.snapshotter.v1
INFO[2021-06-11T11:18:11.644773781+08:00] skip loading plugin "io.containerd.snapshotter.v1.zfs"... error="path /opt/k8s/data/containerd/root/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2021-06-11T11:18:11.644879005+08:00] loading plugin "io.containerd.metadata.v1.bolt"... type=io.containerd.metadata.v1
WARN[2021-06-11T11:18:11.645008002+08:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
INFO[2021-06-11T11:18:11.645065449+08:00] metadata content store policy set policy=shared
INFO[2021-06-11T11:18:11.645617484+08:00] loading plugin "io.containerd.differ.v1.walking"... type=io.containerd.differ.v1
INFO[2021-06-11T11:18:11.645704612+08:00] loading plugin "io.containerd.gc.v1.scheduler"... type=io.containerd.gc.v1
INFO[2021-06-11T11:18:11.645907436+08:00] loading plugin "io.containerd.service.v1.introspection-service"... type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646036737+08:00] loading plugin "io.containerd.service.v1.containers-service"... type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646114177+08:00] loading plugin "io.containerd.service.v1.content-service"... type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646178011+08:00] loading plugin "io.containerd.service.v1.diff-service"... type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646245397+08:00] loading plugin "io.containerd.service.v1.images-service"... type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646323781+08:00] loading plugin "io.containerd.service.v1.leases-service"... type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646402249+08:00] loading plugin "io.containerd.service.v1.namespaces-service"... type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646499042+08:00] loading plugin "io.containerd.service.v1.snapshots-service"... type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646578543+08:00] loading plugin "io.containerd.runtime.v1.linux"... type=io.containerd.runtime.v1
INFO[2021-06-11T11:18:11.646897712+08:00] loading plugin "io.containerd.runtime.v2.task"... type=io.containerd.runtime.v2
WARN[2021-06-11T11:19:51.649897854+08:00] cleaning up after shim disconnected id=743c9ad7d39d31585dde634b8dc7bfabf92406ab3b39585ea829a4b77e17117c namespace=k8s.io
INFO[2021-06-11T11:19:51.649993774+08:00] cleaning up dead shim
WARN[2021-06-11T11:19:51.754989622+08:00] cleanup warnings time="2021-06-11T11:19:51+08:00" level=info msg="starting signal loop" namespace=k8s.io pid=4020
Describe the results you expected:
Containerd should startup after reboot in a few seconds.
What version of containerd are you using:
$ containerd --version
containerd github.com/containerd/containerd v1.5.2 36cc874494a56a253cd181a1a685b44b58a2e34a
Any other relevant information (runC version, CRI configuration, OS/Kernel version, etc.):
runc --version
$ runc --version
runc version 1.0.0-rc95
commit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
spec: 1.0.2-dev
go: go1.16.4
libseccomp: 2.3.1
containerd config.toml
version = 2
root = "/opt/k8s/data/containerd/root"
state = "/opt/k8s/data/containerd/state"
plugin_dir = ""
disabled_plugins = ["io.containerd.snapshotter.v1.aufs", "io.containerd.snapshotter.v1.btrfs"]
required_plugins = ["io.containerd.grpc.v1.cri"]
oom_score = -999
[grpc]
address = "/run/containerd/containerd.sock"
uid = 0
gid = 0
[ttrpc]
address = ""
uid = 0
gid = 0
[debug]
address = ""
uid = 0
gid = 0
level = ""
[metrics]
address = ""
grpc_histogram = false
[cgroup]
path = "/kubereserved.slice"
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
disable_tcp_service = true
enable_selinux = false
sandbox_image = "k8s.gcr.io/pause:3.2"
enable_tls_streaming = false
max_container_log_line_size = 16384
disable_cgroup = false
disable_apparmor = false
restrict_oom_score_adj = false
max_concurrent_downloads = 3
disable_proc_mount = false
unset_seccomp_profile = ""
tolerate_missing_hugetlb_controller = false
disable_hugetlb_controller = false
ignore_image_defined_volumes = false
netns_mounts_under_state_dir = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
disable_snapshot_annotations = false
discard_unpacked_layers = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "runc"
NoPivotRoot = false
NoNewKeyring = false
IoUid = 0
IoGid = 0
SystemdCgroup = true
Root = "/opt/k8s/data/containerd/runc"
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/k8s/cni"
conf_dir = "/opt/k8s/cni/conf"
max_conf_num = 1
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".image_decryption]
key_model = ""
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = falseuname -r
$ uname -r
3.10.0-1062.el7.x86_64
Here I captured the stack trace using SIGUSR1, definitely the main goroutine stucks on loadExistingTasks.
I will dig deeper to see why.
containerd stack dump
=== BEGIN goroutine stack dump ===
goroutine 28 [running]:
github.com/containerd/containerd/cmd/containerd/command.dumpStacks(0xc0004f8301)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/cmd/containerd/command/main.go:336 +0x9a
github.com/containerd/containerd/cmd/containerd/command.handleSignals.func1(0xc000090540, 0xc0000904e0, 0x55889a1227f0, 0xc000122000, 0xc00010e180)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:57 +0x2ba
created by github.com/containerd/containerd/cmd/containerd/command.handleSignals
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:40 +0x8b
goroutine 1 [select]:
github.com/containerd/containerd/pkg/dialer.timeoutDialer(0xc000472667, 0x52, 0x174876e800, 0xc000472601, 0x59, 0x0, 0x0)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/pkg/dialer/dialer.go:66 +0x145
github.com/containerd/containerd/runtime/v2/shim.AnonDialer(0xc000472660, 0x59, 0x174876e800, 0x59, 0x0, 0x0, 0x0)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/shim/util_unix.go:79 +0x88
github.com/containerd/containerd/runtime/v2/shim.AnonReconnectDialer(...)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/shim/util_unix.go:84
github.com/containerd/containerd/runtime/v2/shim.Connect(...)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/shim/util.go:149
github.com/containerd/containerd/runtime/v2.loadShim(0x55889a122860, 0xc00009dc80, 0xc0004a6450, 0xc00049c028, 0xc000590540, 0xc0004c71d0, 0x0, 0x0, 0x0)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/shim.go:70 +0x173
github.com/containerd/containerd/runtime/v2.(*TaskManager).loadTasks(0xc000090ba0, 0x55889a122860, 0xc00009dc80, 0x6, 0x55889a122860)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/manager.go:291 +0x9a8
github.com/containerd/containerd/runtime/v2.(*TaskManager).loadExistingTasks(0xc000090ba0, 0x55889a1227f0, 0xc000122000, 0x0, 0x0)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/manager.go:234 +0x30a
github.com/containerd/containerd/runtime/v2.New(0x55889a1227f0, 0xc000122000, 0xc000499090, 0x42, 0xc0004990e0, 0x43, 0xc0000445a0, 0x1f, 0xc000040bd0, 0x25, ...)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/manager.go:97 +0x21a
github.com/containerd/containerd/runtime/v2.init.0.func1(0xc0003bca80, 0xc00004c440, 0x24, 0xc000509dc0, 0x1d)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/manager.go:76 +0x245
github.com/containerd/containerd/plugin.(*Registration).Init(0xc000219560, 0xc0003bca80, 0x558899e15540)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/plugin/plugin.go:110 +0x3a
github.com/containerd/containerd/services/server.New(0x55889a1227f0, 0xc000122000, 0xc0004924e0, 0x1, 0x1, 0xc000040bd0)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/services/server/server.go:168 +0xd25
github.com/containerd/containerd/cmd/containerd/command.App.func1(0xc0003d6840, 0xc0003d6840, 0xc0004c8a60)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/cmd/containerd/command/main.go:179 +0x77b
github.com/urfave/cli.HandleAction(0x558899e7a880, 0x55889a0c91e0, 0xc0003d6840, 0xc0003d6840, 0x0)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:523 +0x107
github.com/urfave/cli.(*App).Run(0xc0003a2540, 0xc00012c000, 0x2, 0x2, 0x0, 0x0)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:285 +0x655
main.main()
github.com/containerd/containerd/cmd/containerd/main.go:33 +0x51
goroutine 19 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x55889aaeec80)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/k8s.io/klog/v2/klog.go:1169 +0x8d
created by k8s.io/klog/v2.init.0
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/k8s.io/klog/v2/klog.go:417 +0xdf
goroutine 30 [syscall]:
os/signal.signal_recv(0x55889a105618)
/opt/hostedtoolcache/go/1.16.4/x64/src/runtime/sigqueue.go:168 +0xa5
os/signal.loop()
/opt/hostedtoolcache/go/1.16.4/x64/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
/opt/hostedtoolcache/go/1.16.4/x64/src/os/signal/signal.go:151 +0x46
goroutine 31 [select, 2 minutes]:
github.com/docker/go-events.(*Broadcaster).run(0xc00009a0a0)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:117 +0x1be
created by github.com/docker/go-events.NewBroadcaster
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:39 +0x1b5
goroutine 32 [select]:
github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc000090840, 0x55889a1227f0, 0xc000122000)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:268 +0x18f
created by github.com/containerd/containerd/gc/scheduler.init.0.func1
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x445
goroutine 50 [chan receive]:
github.com/containerd/containerd/pkg/dialer.timeoutDialer.func1(0xc000472720, 0xc0004726c0, 0xc000472667, 0x52, 0x174876e800)
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/pkg/dialer/dialer.go:58 +0x77
created by github.com/containerd/containerd/pkg/dialer.timeoutDialer
/home/runner/work/containerd/containerd/src/github.com/containerd/containerd/pkg/dialer/dialer.go:49 +0xc5
=== END goroutine stack dump ===