Skip to content

NRI plugin registration can trigger a deadlock #10085

@acurtiz

Description

@acurtiz

We have a plugin that is registered to containerd externally (as opposed to being pre-registered). This plugin is deployed as a k8s DaemonSet.

We've detected a deadlock, in version containerd v1.7.3 (which uses containerd/nri 0.4.0). It looks to still be unfixed.

There are two involved locks: the adaptation.go lock, and the nri.go lock.

The deadlock can happen because these independent routines acquire the locks in inverse order from each other:

  1. During plugin registration, the adaptation.go lock is acquired and then syncFn is invoked; in this case, syncFn is defined here, which attempts to immediately acquire the nri.go lock.
  2. An independent StartContainer can occur in which the nri.go lock is acquired which goes through here and attempts to acquire the adaptation.go lock. Other events do exactly the same, so it's not limited to StartContainer.

The stack traces that confirm this are below.

The plugin registration stack trace:

goroutine 2650 [sync.Mutex.Lock, 1129 minutes]:
sync.runtime_SemacquireMutex(0xc001b82600?, 0x26?, 0xc0014c9af8?)
	/usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0000a1600)
	/usr/lib64/go/x86_64-cros-linux-gnu/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
	/usr/lib64/go/x86_64-cros-linux-gnu/src/sync/mutex.go:90
github.com/containerd/containerd/pkg/nri.(*local).syncPlugin(0xc0000a1600, {0x58c98a692b40, 0x58c98b49b600}, 0xc001c96930)
	/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/nri/nri.go:440 +0x74
github.com/containerd/nri/pkg/adaptation.(*Adaptation).acceptPluginConnections.func1()
	/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/containerd/nri/pkg/adaptation/adaptation.go:424 +0x1c4
created by github.com/containerd/nri/pkg/adaptation.(*Adaptation).acceptPluginConnections in goroutine 358
	/build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/containerd/nri/pkg/adaptation/adaptation.go:403 +0xcd

The StartContainer stack trace:

goroutine 2636 [sync.Mutex.Lock, 1129 minutes]:
sync.runtime_SemacquireMutex(0x7ba912937f18?, 0x80?, 0xc0012e4c00?)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0002f8a00)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/sync/mutex.go:90
github.com/containerd/nri/pkg/adaptation.(*Adaptation).StateChange(0x58c98a69de58?, {0x58c98a692b78, 0xc001a8a4b0}, 0xc001a21bc0)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/containerd/nri/pkg/adaptation/adaptation.go:285 +0x85
github.com/containerd/nri/pkg/adaptation.(*Adaptation).StartContainer(...)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/containerd/nri/pkg/adaptation/adaptation.go:216
github.com/containerd/containerd/pkg/nri.(*local).StartContainer(0xc0000a1600, {0x58c98a692b78, 0xc001a8a4b0}, {0x58c98a69ce30?, 0xc001c408d0?}, {0x58c98a69de58, 0xc00269b2f0})
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/nri/nri.go:290 +0x19f
github.com/containerd/containerd/pkg/cri/nri.(*API).StartContainer(0xc0000a1760, {0x58c98a692b78, 0xc001a8a4b0}, 0x6?, 0x0?)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/cri/nri/nri_api_linux.go:156 +0xdc
github.com/containerd/containerd/pkg/cri/server.(*criService).StartContainer(0xc0001e3b00, {0x58c98a692b78?, 0xc001a8a4b0}, 0xc00209c0a8)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/cri/server/container_start.go:158 +0x150b
github.com/containerd/containerd/pkg/cri/instrument.(*instrumentedService).StartContainer(0xc0004f7330, {0x58c98a692b78?, 0xc001a8a270}, 0xc00209c0a8)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/pkg/cri/instrument/instrumented_service.go:507 +0x1db
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_StartContainer_Handler.func1({0x58c98a692b78, 0xc001a8a270}, {0x58c98a5cdd40?, 0xc00209c0a8})
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go:10863 +0x75
github.com/containerd/containerd/services/server.unaryNamespaceInterceptor({0x58c98a692b78, 0xc001a8a270}, {0x58c98a5cdd40, 0xc00209c0a8}, 0xc000124478?, 0xc00209c0c0)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/services/server/namespace.go:31 +0x65
github.com/containerd/containerd/services/server.New.ChainUnaryServer.func5.1.1({0x58c98a692b78?, 0xc001a8a270?}, {0x58c98a5cdd40?, 0xc00209c0a8?})
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x37
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func3({0x58c98a692b78, 0xc001a8a270}, {0x58c98a5cdd40, 0xc00209c0a8}, 0xc0017e95b0?, 0xc00228c0c0)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:107 +0x83
github.com/containerd/containerd/services/server.New.ChainUnaryServer.func5.1.1({0x58c98a692b78?, 0xc001a8a270?}, {0x58c98a5cdd40?, 0xc00209c0a8?})
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x37
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x58c98a692b78, 0xc001a8a1b0}, {0x58c98a5cdd40, 0xc00209c0a8}, 0xc00228c0a0, 0xc00228c0e0)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:376 +0x5cd
github.com/containerd/containerd/services/server.New.ChainUnaryServer.func5.1.1({0x58c98a692b78?, 0xc001a8a1b0?}, {0x58c98a5cdd40?, 0xc00209c0a8?})
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x37
github.com/containerd/containerd/services/server.New.ChainUnaryServer.func5({0x58c98a692b78, 0xc001a8a1b0}, {0x58c98a5cdd40, 0xc00209c0a8}, 0xc000f56a38?, 0x58c98a3f4400?)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xb5
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_StartContainer_Handler({0x58c98a63a3a0?, 0xc0004f7330}, {0x58c98a692b78, 0xc001a8a1b0}, 0xc00149c070, 0xc0002000c0)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go:10865 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00044a000, {0x58c98a69bb00, 0xc0018cc000}, 0xc001608000, 0xc000200d50, 0x58c98b3e2b28, 0x0)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:1374 +0xde7
google.golang.org/grpc.(*Server).handleStream(0xc00044a000, {0x58c98a69bb00, 0xc0018cc000}, 0xc001608000, 0x0)
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:1751 +0x9e7
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:986 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 861
        /build/lakitu/tmp/portage/app-containers/containerd-1.7.13-r1/work/containerd-1.7.13/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:997 +0x145

The effects of this bug are the plugin is stuck (Synchronize callback is never invoked) and containerd is unable to process certain events (such as StartContainer). The only remedy appears to be restarting containerd.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/nriNode Resource Interface (NRI)dependenciesPull requests that update a dependency filekind/bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions