Skip to content

containerd 1.3.0+ leaks an eventfd on every exec #3949

@sethp-nr

Description

@sethp-nr

Description

Whenever a process is exec'd inside an existing container, we see an eventfd file descriptor leak:

# lsof -p 7039
COMMAND    PID USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
container 7039 root  cwd       DIR               0,23      220     810 /run/containerd/io.containerd.runtime.v2.task/default/redis
container 7039 root  rtd       DIR              259,1     4096       2 /
container 7039 root  txt       REG              259,1 12168626  257197 /usr/local/bin/containerd-shim-runc-v1
container 7039 root    0r      CHR                1,3      0t0       6 /dev/null
container 7039 root    1w      CHR                1,3      0t0       6 /dev/null
container 7039 root    2w      CHR                1,3      0t0       6 /dev/null
...
container 7039 root   29u  a_inode               0,13        0   10502 [eventfd]
container 7039 root   34u  a_inode               0,13        0   10502 [eventfd]
container 7039 root   35u  a_inode               0,13        0   10502 [eventfd]
container 7039 root   36u  a_inode               0,13        0   10502 [eventfd]
container 7039 root   37u  a_inode               0,13        0   10502 [eventfd]
...

This issue is especially pronounced using containerd to run kubernetes pods that define an exec-based liveness or readiness probe. We have one such pod that has a probe with an interval of one second, meaning we leak a file descriptor per second on that machine.

Steps to reproduce the issue:

  1. ctr run -d --runtime io.containerd.runc.v1 docker.io/library/redis:latest redis
  2. ctr t exec --exec-id foo redis echo

Output of containerd --version:

containerd github.com/containerd/containerd v1.3.0 36cf5b690dcc00ff0f34ff7799209050c3d0c59a

Any other relevant information:

We used the excellent kubectl-trace project to execute a small bpftrace program[1] to observe eventfd allocations from within the shim process:

$ kubectl trace run NODE_NAME -e 'tracepoint:syscalls:sys_enter_eventfd* /pid==7039/ { printf("%s\n", ustack(perf)); }'
trace 0a46a4b0-3273-11ea-babe-784f43872f68 created
$ k trace logs 0a46a4b0-3273-11ea-babe-784f43872f68
if your program has maps to print, send a SIGINT using Ctrl-C, if you want to interrupt the execution send SIGINT two times
Attaching 2 probes...

	475c2b syscall.RawSyscall+43 (/usr/local/bin/containerd-shim-runc-v1)
	672c80 github.com/containerd/containerd/vendor/github.com/containerd/cgroups.(*cgroup).OOMEventFD+336 (/usr/local/bin/containerd-shim-runc-v1)
	7e07d1 github.com/containerd/containerd/pkg/oom.(*Epoller).Add+161 (/usr/local/bin/containerd-shim-runc-v1)
	804ce2 github.com/containerd/containerd/runtime/v2/runc/v1.(*service).Start+434 (/usr/local/bin/containerd-shim-runc-v1)
	7c3bc5 github.com/containerd/containerd/runtime/v2/task.RegisterTaskService.func3+197 (/usr/local/bin/containerd-shim-runc-v1)
	787e74 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.defaultServerInterceptor+68 (/usr/local/bin/containerd-shim-runc-v1)
	78ae48 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serviceSet).dispatch+520 (/usr/local/bin/containerd-shim-runc-v1)
	78aa95 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serviceSet).call+181 (/usr/local/bin/containerd-shim-runc-v1)
	78d0d0 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run.func2+288 (/usr/local/bin/containerd-shim-runc-v1)
	45ac31 runtime.goexit+1 (/usr/local/bin/containerd-shim-runc-v1)

That calltrace appears to indicate that exec'ing the process is re-establishing the OOM monitor for the cgroup on every call. We believe this issue was introduced in 6bcbf88 – previously, the code looked like this:

// Start a process
func (s *service) Start(ctx context.Context, r *taskAPI.StartRequest) (*taskAPI.StartResponse, error) {
...
	if s.getCgroup() == nil && p.Pid() > 0 {
		cg, err := cgroups.Load(cgroups.V1, cgroups.PidPath(p.Pid()))
		if err != nil {
			logrus.WithError(err).Errorf("loading cgroup for %d", p.Pid())
		}
		s.setCgroup(cg)
...
func (s *service) setCgroup(cg cgroups.Cgroup) {
	s.mu.Lock()
	s.cg = cg
	s.mu.Unlock()
	if err := s.ep.add(s.id, cg); err != nil {
		logrus.WithError(err).Error("add cg to OOM monitor")
	}
}

The net effect of the if s.getCgroup() == nil check was that only the first call to Start was creating an OOM monitor eventfd. Now, it appears that every call to Start will re-add the cgroup to the monitor:

// Start a process
func (s *service) Start(ctx context.Context, r *taskAPI.StartRequest) (*taskAPI.StartResponse, error) {
...
	if err := s.ep.Add(container.ID, container.Cgroup()); err != nil {
		logrus.WithError(err).Error("add cg to OOM monitor")
	}
...

Since the ep (EventPoller) is only cleaning up those OOM monitors when the cgroup is deleted, this appears to be what's causing the leak.

[1] With a custom-built v1.3.0 containerd-shim produced by GOOS=linux GOARCH=amd64 make bin/containerd-shim-runc-v1 GODEBUG=1 to preserve symbols

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions