Skip to content

Containerd takes hours to startup after node reboot #5597

@Einsfier

Description

@Einsfier

Description

I am using containerd as K8S container runtime. It works fine with my configuration, but after a node reboot, it will take hours to wait containerd to finish startup.
Looks like it hangs on loading plugin io.containerd.runtime.v2.task for hours.

Steps to reproduce the issue:

  1. Prepare a node with containerd as kubelet runtime.
  2. Make sure there are workloads been scheduled and running on the node.
  3. Reboot the node via reboot now
  4. Containerd stuck on starting up.

Describe the results you received:

From the log you can see that it takes about 100 seconds for containerd to cleanup the first dead shim.
So I guess the more workload I have, the longer it takes to startup after reboot.

containerd logs
$ systemctl status containerd

INFO[2021-06-11T11:18:11.556160877+08:00] starting containerd                           revision=36cc874494a56a253cd181a1a685b44b58a2e34a version=v1.5.2
INFO[2021-06-11T11:18:11.642627723+08:00] loading plugin "io.containerd.content.v1.content"...  type=io.containerd.content.v1
INFO[2021-06-11T11:18:11.642838029+08:00] loading plugin "io.containerd.snapshotter.v1.devmapper"...  type=io.containerd.snapshotter.v1
WARN[2021-06-11T11:18:11.643056000+08:00] failed to load plugin io.containerd.snapshotter.v1.devmapper  error="devmapper not configured"
INFO[2021-06-11T11:18:11.643122664+08:00] loading plugin "io.containerd.snapshotter.v1.native"...  type=io.containerd.snapshotter.v1
INFO[2021-06-11T11:18:11.643284601+08:00] loading plugin "io.containerd.snapshotter.v1.overlayfs"...  type=io.containerd.snapshotter.v1
INFO[2021-06-11T11:18:11.643960691+08:00] loading plugin "io.containerd.snapshotter.v1.zfs"...  type=io.containerd.snapshotter.v1
INFO[2021-06-11T11:18:11.644773781+08:00] skip loading plugin "io.containerd.snapshotter.v1.zfs"...  error="path /opt/k8s/data/containerd/root/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2021-06-11T11:18:11.644879005+08:00] loading plugin "io.containerd.metadata.v1.bolt"...  type=io.containerd.metadata.v1
WARN[2021-06-11T11:18:11.645008002+08:00] could not use snapshotter devmapper in metadata plugin  error="devmapper not configured"
INFO[2021-06-11T11:18:11.645065449+08:00] metadata content store policy set             policy=shared
INFO[2021-06-11T11:18:11.645617484+08:00] loading plugin "io.containerd.differ.v1.walking"...  type=io.containerd.differ.v1
INFO[2021-06-11T11:18:11.645704612+08:00] loading plugin "io.containerd.gc.v1.scheduler"...  type=io.containerd.gc.v1
INFO[2021-06-11T11:18:11.645907436+08:00] loading plugin "io.containerd.service.v1.introspection-service"...  type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646036737+08:00] loading plugin "io.containerd.service.v1.containers-service"...  type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646114177+08:00] loading plugin "io.containerd.service.v1.content-service"...  type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646178011+08:00] loading plugin "io.containerd.service.v1.diff-service"...  type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646245397+08:00] loading plugin "io.containerd.service.v1.images-service"...  type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646323781+08:00] loading plugin "io.containerd.service.v1.leases-service"...  type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646402249+08:00] loading plugin "io.containerd.service.v1.namespaces-service"...  type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646499042+08:00] loading plugin "io.containerd.service.v1.snapshots-service"...  type=io.containerd.service.v1
INFO[2021-06-11T11:18:11.646578543+08:00] loading plugin "io.containerd.runtime.v1.linux"...  type=io.containerd.runtime.v1
INFO[2021-06-11T11:18:11.646897712+08:00] loading plugin "io.containerd.runtime.v2.task"...  type=io.containerd.runtime.v2
WARN[2021-06-11T11:19:51.649897854+08:00] cleaning up after shim disconnected           id=743c9ad7d39d31585dde634b8dc7bfabf92406ab3b39585ea829a4b77e17117c namespace=k8s.io
INFO[2021-06-11T11:19:51.649993774+08:00] cleaning up dead shim
WARN[2021-06-11T11:19:51.754989622+08:00] cleanup warnings time="2021-06-11T11:19:51+08:00" level=info msg="starting signal loop" namespace=k8s.io pid=4020

Describe the results you expected:

Containerd should startup after reboot in a few seconds.

What version of containerd are you using:

$ containerd --version
containerd github.com/containerd/containerd v1.5.2 36cc874494a56a253cd181a1a685b44b58a2e34a

Any other relevant information (runC version, CRI configuration, OS/Kernel version, etc.):

runc --version
$ runc --version
runc version 1.0.0-rc95
commit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
spec: 1.0.2-dev
go: go1.16.4
libseccomp: 2.3.1
containerd config.toml
version = 2
root = "/opt/k8s/data/containerd/root"
state = "/opt/k8s/data/containerd/state"
plugin_dir = ""
disabled_plugins = ["io.containerd.snapshotter.v1.aufs", "io.containerd.snapshotter.v1.btrfs"]
required_plugins = ["io.containerd.grpc.v1.cri"]
oom_score = -999

[grpc]
  address = "/run/containerd/containerd.sock"
  uid = 0
  gid = 0

[ttrpc]
  address = ""
  uid = 0
  gid = 0

[debug]
  address = ""
  uid = 0
  gid = 0
  level = ""

[metrics]
  address = ""
  grpc_histogram = false

[cgroup]
  path = "/kubereserved.slice"

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    disable_tcp_service = true
    enable_selinux = false
    sandbox_image = "k8s.gcr.io/pause:3.2"
    enable_tls_streaming = false
    max_container_log_line_size = 16384
    disable_cgroup = false
    disable_apparmor = false
    restrict_oom_score_adj = false
    max_concurrent_downloads = 3
    disable_proc_mount = false
    unset_seccomp_profile = ""
    tolerate_missing_hugetlb_controller = false
    disable_hugetlb_controller = false
    ignore_image_defined_volumes = false
    netns_mounts_under_state_dir = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
      default_runtime_name = "runc"
      disable_snapshot_annotations = false
      discard_unpacked_layers = false
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
          privileged_without_host_devices = false
          base_runtime_spec = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = "runc"
            NoPivotRoot = false
            NoNewKeyring = false
            IoUid = 0
            IoGid = 0
            SystemdCgroup = true
            Root = "/opt/k8s/data/containerd/runc"
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/k8s/cni"
      conf_dir = "/opt/k8s/cni/conf"
      max_conf_num = 1
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = ""
  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"
  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false
uname -r
$ uname -r
3.10.0-1062.el7.x86_64

Here I captured the stack trace using SIGUSR1, definitely the main goroutine stucks on loadExistingTasks.
I will dig deeper to see why.

containerd stack dump
=== BEGIN goroutine stack dump ===
goroutine 28 [running]:
github.com/containerd/containerd/cmd/containerd/command.dumpStacks(0xc0004f8301)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/cmd/containerd/command/main.go:336 +0x9a
github.com/containerd/containerd/cmd/containerd/command.handleSignals.func1(0xc000090540, 0xc0000904e0, 0x55889a1227f0, 0xc000122000, 0xc00010e180)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:57 +0x2ba
created by github.com/containerd/containerd/cmd/containerd/command.handleSignals
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:40 +0x8b

goroutine 1 [select]:
github.com/containerd/containerd/pkg/dialer.timeoutDialer(0xc000472667, 0x52, 0x174876e800, 0xc000472601, 0x59, 0x0, 0x0)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/pkg/dialer/dialer.go:66 +0x145
github.com/containerd/containerd/runtime/v2/shim.AnonDialer(0xc000472660, 0x59, 0x174876e800, 0x59, 0x0, 0x0, 0x0)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/shim/util_unix.go:79 +0x88
github.com/containerd/containerd/runtime/v2/shim.AnonReconnectDialer(...)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/shim/util_unix.go:84
github.com/containerd/containerd/runtime/v2/shim.Connect(...)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/shim/util.go:149
github.com/containerd/containerd/runtime/v2.loadShim(0x55889a122860, 0xc00009dc80, 0xc0004a6450, 0xc00049c028, 0xc000590540, 0xc0004c71d0, 0x0, 0x0, 0x0)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/shim.go:70 +0x173
github.com/containerd/containerd/runtime/v2.(*TaskManager).loadTasks(0xc000090ba0, 0x55889a122860, 0xc00009dc80, 0x6, 0x55889a122860)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/manager.go:291 +0x9a8
github.com/containerd/containerd/runtime/v2.(*TaskManager).loadExistingTasks(0xc000090ba0, 0x55889a1227f0, 0xc000122000, 0x0, 0x0)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/manager.go:234 +0x30a
github.com/containerd/containerd/runtime/v2.New(0x55889a1227f0, 0xc000122000, 0xc000499090, 0x42, 0xc0004990e0, 0x43, 0xc0000445a0, 0x1f, 0xc000040bd0, 0x25, ...)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/manager.go:97 +0x21a
github.com/containerd/containerd/runtime/v2.init.0.func1(0xc0003bca80, 0xc00004c440, 0x24, 0xc000509dc0, 0x1d)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/runtime/v2/manager.go:76 +0x245
github.com/containerd/containerd/plugin.(*Registration).Init(0xc000219560, 0xc0003bca80, 0x558899e15540)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/plugin/plugin.go:110 +0x3a
github.com/containerd/containerd/services/server.New(0x55889a1227f0, 0xc000122000, 0xc0004924e0, 0x1, 0x1, 0xc000040bd0)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/services/server/server.go:168 +0xd25
github.com/containerd/containerd/cmd/containerd/command.App.func1(0xc0003d6840, 0xc0003d6840, 0xc0004c8a60)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/cmd/containerd/command/main.go:179 +0x77b
github.com/urfave/cli.HandleAction(0x558899e7a880, 0x55889a0c91e0, 0xc0003d6840, 0xc0003d6840, 0x0)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:523 +0x107
github.com/urfave/cli.(*App).Run(0xc0003a2540, 0xc00012c000, 0x2, 0x2, 0x0, 0x0)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:285 +0x655
main.main()
        github.com/containerd/containerd/cmd/containerd/main.go:33 +0x51

goroutine 19 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x55889aaeec80)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/k8s.io/klog/v2/klog.go:1169 +0x8d
created by k8s.io/klog/v2.init.0
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/k8s.io/klog/v2/klog.go:417 +0xdf

goroutine 30 [syscall]:
os/signal.signal_recv(0x55889a105618)
        /opt/hostedtoolcache/go/1.16.4/x64/src/runtime/sigqueue.go:168 +0xa5
os/signal.loop()
        /opt/hostedtoolcache/go/1.16.4/x64/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
        /opt/hostedtoolcache/go/1.16.4/x64/src/os/signal/signal.go:151 +0x46

goroutine 31 [select, 2 minutes]:
github.com/docker/go-events.(*Broadcaster).run(0xc00009a0a0)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:117 +0x1be
created by github.com/docker/go-events.NewBroadcaster
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:39 +0x1b5

goroutine 32 [select]:
github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc000090840, 0x55889a1227f0, 0xc000122000)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:268 +0x18f
created by github.com/containerd/containerd/gc/scheduler.init.0.func1
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x445

goroutine 50 [chan receive]:
github.com/containerd/containerd/pkg/dialer.timeoutDialer.func1(0xc000472720, 0xc0004726c0, 0xc000472667, 0x52, 0x174876e800)
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/pkg/dialer/dialer.go:58 +0x77
created by github.com/containerd/containerd/pkg/dialer.timeoutDialer
        /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/pkg/dialer/dialer.go:49 +0xc5

=== END goroutine stack dump ===

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions