-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
We have a plugin that is registered to containerd externally (as opposed to being pre-registered). This plugin is deployed as a k8s DaemonSet.
We've detected a deadlock, in version containerd v1.7.3 (which uses containerd/nri 0.4.0). It looks to still be unfixed.
There are two involved locks: the adaptation.go lock, and the nri.go lock.
The deadlock can happen because these independent routines acquire the locks in inverse order from each other:
- During plugin registration, the adaptation.go lock is acquired and then syncFn is invoked; in this case, syncFn is defined here, which attempts to immediately acquire the nri.go lock.
- An independent
StartContainercan occur in which the nri.go lock is acquired which goes through here and attempts to acquire the adaptation.go lock. Other events do exactly the same, so it's not limited toStartContainer.
The stack traces that confirm this are below.
The plugin registration stack trace:
goroutine 2650 [sync.Mutex.Lock, 1129 minutes]:
sync.runtime_SemacquireMutex(0xc001b82600?, 0x26?, 0xc0014c9af8?)
/usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0000a1600)
/usr/lib64/go/x86_64-cros-linux-gnu/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
/usr/lib64/go/x86_64-cros-linux-gnu/src/sync/mutex.go:90
github.com/containerd/containerd/pkg/nri.(*local).syncPlugin(0xc0000a1600, {0x58c98a692b40, 0x58c98b49b600}, 0xc001c96930)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/nri/nri.go:440 +0x74
github.com/containerd/nri/pkg/adaptation.(*Adaptation).acceptPluginConnections.func1()
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/containerd/nri/pkg/adaptation/adaptation.go:424 +0x1c4
created by github.com/containerd/nri/pkg/adaptation.(*Adaptation).acceptPluginConnections in goroutine 358
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/containerd/nri/pkg/adaptation/adaptation.go:403 +0xcd
The StartContainer stack trace:
goroutine 2636 [sync.Mutex.Lock, 1129 minutes]:
sync.runtime_SemacquireMutex(0x7ba912937f18?, 0x80?, 0xc0012e4c00?)
/usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0002f8a00)
/usr/lib64/go/x86_64-cros-linux-gnu/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
/usr/lib64/go/x86_64-cros-linux-gnu/src/sync/mutex.go:90
github.com/containerd/nri/pkg/adaptation.(*Adaptation).StateChange(0x58c98a69de58?, {0x58c98a692b78, 0xc001a8a4b0}, 0xc001a21bc0)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/containerd/nri/pkg/adaptation/adaptation.go:285 +0x85
github.com/containerd/nri/pkg/adaptation.(*Adaptation).StartContainer(...)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/containerd/nri/pkg/adaptation/adaptation.go:216
github.com/containerd/containerd/pkg/nri.(*local).StartContainer(0xc0000a1600, {0x58c98a692b78, 0xc001a8a4b0}, {0x58c98a69ce30?, 0xc001c408d0?}, {0x58c98a69de58, 0xc00269b2f0})
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/nri/nri.go:290 +0x19f
github.com/containerd/containerd/pkg/cri/nri.(*API).StartContainer(0xc0000a1760, {0x58c98a692b78, 0xc001a8a4b0}, 0x6?, 0x0?)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/cri/nri/nri_api_linux.go:156 +0xdc
github.com/containerd/containerd/pkg/cri/server.(*criService).StartContainer(0xc0001e3b00, {0x58c98a692b78?, 0xc001a8a4b0}, 0xc00209c0a8)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/cri/server/container_start.go:158 +0x150b
github.com/containerd/containerd/pkg/cri/instrument.(*instrumentedService).StartContainer(0xc0004f7330, {0x58c98a692b78?, 0xc001a8a270}, 0xc00209c0a8)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/cri/instrument/instrumented_service.go:507 +0x1db
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_StartContainer_Handler.func1({0x58c98a692b78, 0xc001a8a270}, {0x58c98a5cdd40?, 0xc00209c0a8})
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go:10863 +0x75
github.com/containerd/containerd/services/server.unaryNamespaceInterceptor({0x58c98a692b78, 0xc001a8a270}, {0x58c98a5cdd40, 0xc00209c0a8}, 0xc000124478?, 0xc00209c0c0)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/services/server/namespace.go:31 +0x65
github.com/containerd/containerd/services/server.New.ChainUnaryServer.func5.1.1({0x58c98a692b78?, 0xc001a8a270?}, {0x58c98a5cdd40?, 0xc00209c0a8?})
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x37
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func3({0x58c98a692b78, 0xc001a8a270}, {0x58c98a5cdd40, 0xc00209c0a8}, 0xc0017e95b0?, 0xc00228c0c0)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:107 +0x83
github.com/containerd/containerd/services/server.New.ChainUnaryServer.func5.1.1({0x58c98a692b78?, 0xc001a8a270?}, {0x58c98a5cdd40?, 0xc00209c0a8?})
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x37
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x58c98a692b78, 0xc001a8a1b0}, {0x58c98a5cdd40, 0xc00209c0a8}, 0xc00228c0a0, 0xc00228c0e0)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:376 +0x5cd
github.com/containerd/containerd/services/server.New.ChainUnaryServer.func5.1.1({0x58c98a692b78?, 0xc001a8a1b0?}, {0x58c98a5cdd40?, 0xc00209c0a8?})
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x37
github.com/containerd/containerd/services/server.New.ChainUnaryServer.func5({0x58c98a692b78, 0xc001a8a1b0}, {0x58c98a5cdd40, 0xc00209c0a8}, 0xc000f56a38?, 0x58c98a3f4400?)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xb5
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_StartContainer_Handler({0x58c98a63a3a0?, 0xc0004f7330}, {0x58c98a692b78, 0xc001a8a1b0}, 0xc00149c070, 0xc0002000c0)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go:10865 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00044a000, {0x58c98a69bb00, 0xc0018cc000}, 0xc001608000, 0xc000200d50, 0x58c98b3e2b28, 0x0)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:1374 +0xde7
google.golang.org/grpc.(*Server).handleStream(0xc00044a000, {0x58c98a69bb00, 0xc0018cc000}, 0xc001608000, 0x0)
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:1751 +0x9e7
google.golang.org/grpc.(*Server).serveStreams.func1.1()
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:986 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 861
/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:997 +0x145
The effects of this bug are the plugin is stuck (Synchronize callback is never invoked) and containerd is unable to process certain events (such as StartContainer). The only remedy appears to be restarting containerd.