-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Description
Whenever a process is exec'd inside an existing container, we see an eventfd file descriptor leak:
# lsof -p 7039
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
container 7039 root cwd DIR 0,23 220 810 /run/containerd/io.containerd.runtime.v2.task/default/redis
container 7039 root rtd DIR 259,1 4096 2 /
container 7039 root txt REG 259,1 12168626 257197 /usr/local/bin/containerd-shim-runc-v1
container 7039 root 0r CHR 1,3 0t0 6 /dev/null
container 7039 root 1w CHR 1,3 0t0 6 /dev/null
container 7039 root 2w CHR 1,3 0t0 6 /dev/null
...
container 7039 root 29u a_inode 0,13 0 10502 [eventfd]
container 7039 root 34u a_inode 0,13 0 10502 [eventfd]
container 7039 root 35u a_inode 0,13 0 10502 [eventfd]
container 7039 root 36u a_inode 0,13 0 10502 [eventfd]
container 7039 root 37u a_inode 0,13 0 10502 [eventfd]
...
This issue is especially pronounced using containerd to run kubernetes pods that define an exec-based liveness or readiness probe. We have one such pod that has a probe with an interval of one second, meaning we leak a file descriptor per second on that machine.
Steps to reproduce the issue:
ctr run -d --runtime io.containerd.runc.v1 docker.io/library/redis:latest redisctr t exec --exec-id foo redis echo
Output of containerd --version:
containerd github.com/containerd/containerd v1.3.0 36cf5b690dcc00ff0f34ff7799209050c3d0c59a
Any other relevant information:
We used the excellent kubectl-trace project to execute a small bpftrace program[1] to observe eventfd allocations from within the shim process:
$ kubectl trace run NODE_NAME -e 'tracepoint:syscalls:sys_enter_eventfd* /pid==7039/ { printf("%s\n", ustack(perf)); }'
trace 0a46a4b0-3273-11ea-babe-784f43872f68 created
$ k trace logs 0a46a4b0-3273-11ea-babe-784f43872f68
if your program has maps to print, send a SIGINT using Ctrl-C, if you want to interrupt the execution send SIGINT two times
Attaching 2 probes...
475c2b syscall.RawSyscall+43 (/usr/local/bin/containerd-shim-runc-v1)
672c80 github.com/containerd/containerd/vendor/github.com/containerd/cgroups.(*cgroup).OOMEventFD+336 (/usr/local/bin/containerd-shim-runc-v1)
7e07d1 github.com/containerd/containerd/pkg/oom.(*Epoller).Add+161 (/usr/local/bin/containerd-shim-runc-v1)
804ce2 github.com/containerd/containerd/runtime/v2/runc/v1.(*service).Start+434 (/usr/local/bin/containerd-shim-runc-v1)
7c3bc5 github.com/containerd/containerd/runtime/v2/task.RegisterTaskService.func3+197 (/usr/local/bin/containerd-shim-runc-v1)
787e74 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.defaultServerInterceptor+68 (/usr/local/bin/containerd-shim-runc-v1)
78ae48 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serviceSet).dispatch+520 (/usr/local/bin/containerd-shim-runc-v1)
78aa95 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serviceSet).call+181 (/usr/local/bin/containerd-shim-runc-v1)
78d0d0 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run.func2+288 (/usr/local/bin/containerd-shim-runc-v1)
45ac31 runtime.goexit+1 (/usr/local/bin/containerd-shim-runc-v1)
That calltrace appears to indicate that exec'ing the process is re-establishing the OOM monitor for the cgroup on every call. We believe this issue was introduced in 6bcbf88 – previously, the code looked like this:
// Start a process
func (s *service) Start(ctx context.Context, r *taskAPI.StartRequest) (*taskAPI.StartResponse, error) {
...
if s.getCgroup() == nil && p.Pid() > 0 {
cg, err := cgroups.Load(cgroups.V1, cgroups.PidPath(p.Pid()))
if err != nil {
logrus.WithError(err).Errorf("loading cgroup for %d", p.Pid())
}
s.setCgroup(cg)
...
func (s *service) setCgroup(cg cgroups.Cgroup) {
s.mu.Lock()
s.cg = cg
s.mu.Unlock()
if err := s.ep.add(s.id, cg); err != nil {
logrus.WithError(err).Error("add cg to OOM monitor")
}
}
The net effect of the if s.getCgroup() == nil check was that only the first call to Start was creating an OOM monitor eventfd. Now, it appears that every call to Start will re-add the cgroup to the monitor:
// Start a process
func (s *service) Start(ctx context.Context, r *taskAPI.StartRequest) (*taskAPI.StartResponse, error) {
...
if err := s.ep.Add(container.ID, container.Cgroup()); err != nil {
logrus.WithError(err).Error("add cg to OOM monitor")
}
...
Since the ep (EventPoller) is only cleaning up those OOM monitors when the cgroup is deleted, this appears to be what's causing the leak.
[1] With a custom-built v1.3.0 containerd-shim produced by GOOS=linux GOARCH=amd64 make bin/containerd-shim-runc-v1 GODEBUG=1 to preserve symbols