containerd 1.3.0+ leaks an eventfd on every exec

**Description**

Whenever a process is exec'd inside an existing container, we see an eventfd file descriptor leak:

```
# lsof -p 7039
COMMAND    PID USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
container 7039 root  cwd       DIR               0,23      220     810 /run/containerd/io.containerd.runtime.v2.task/default/redis
container 7039 root  rtd       DIR              259,1     4096       2 /
container 7039 root  txt       REG              259,1 12168626  257197 /usr/local/bin/containerd-shim-runc-v1
container 7039 root    0r      CHR                1,3      0t0       6 /dev/null
container 7039 root    1w      CHR                1,3      0t0       6 /dev/null
container 7039 root    2w      CHR                1,3      0t0       6 /dev/null
...
container 7039 root   29u  a_inode               0,13        0   10502 [eventfd]
container 7039 root   34u  a_inode               0,13        0   10502 [eventfd]
container 7039 root   35u  a_inode               0,13        0   10502 [eventfd]
container 7039 root   36u  a_inode               0,13        0   10502 [eventfd]
container 7039 root   37u  a_inode               0,13        0   10502 [eventfd]
...
```

This issue is especially pronounced using containerd to run kubernetes pods that define an exec-based liveness or readiness probe. We have one such pod that has a probe with an interval of one second, meaning we leak a file descriptor per second on that machine.

**Steps to reproduce the issue:**
1. `ctr run -d --runtime io.containerd.runc.v1 docker.io/library/redis:latest redis`
2. `ctr t exec --exec-id foo redis echo`

**Output of `containerd --version`:**

```
containerd github.com/containerd/containerd v1.3.0 36cf5b690dcc00ff0f34ff7799209050c3d0c59a
```

**Any other relevant information:**

We used the excellent [kubectl-trace](https://github.com/iovisor/kubectl-trace) project to execute a small bpftrace program[1] to observe eventfd allocations from within the shim process:

```
$ kubectl trace run NODE_NAME -e 'tracepoint:syscalls:sys_enter_eventfd* /pid==7039/ { printf("%s\n", ustack(perf)); }'
trace 0a46a4b0-3273-11ea-babe-784f43872f68 created
$ k trace logs 0a46a4b0-3273-11ea-babe-784f43872f68
if your program has maps to print, send a SIGINT using Ctrl-C, if you want to interrupt the execution send SIGINT two times
Attaching 2 probes...

	475c2b syscall.RawSyscall+43 (/usr/local/bin/containerd-shim-runc-v1)
	672c80 github.com/containerd/containerd/vendor/github.com/containerd/cgroups.(*cgroup).OOMEventFD+336 (/usr/local/bin/containerd-shim-runc-v1)
	7e07d1 github.com/containerd/containerd/pkg/oom.(*Epoller).Add+161 (/usr/local/bin/containerd-shim-runc-v1)
	804ce2 github.com/containerd/containerd/runtime/v2/runc/v1.(*service).Start+434 (/usr/local/bin/containerd-shim-runc-v1)
	7c3bc5 github.com/containerd/containerd/runtime/v2/task.RegisterTaskService.func3+197 (/usr/local/bin/containerd-shim-runc-v1)
	787e74 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.defaultServerInterceptor+68 (/usr/local/bin/containerd-shim-runc-v1)
	78ae48 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serviceSet).dispatch+520 (/usr/local/bin/containerd-shim-runc-v1)
	78aa95 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serviceSet).call+181 (/usr/local/bin/containerd-shim-runc-v1)
	78d0d0 github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*serverConn).run.func2+288 (/usr/local/bin/containerd-shim-runc-v1)
	45ac31 runtime.goexit+1 (/usr/local/bin/containerd-shim-runc-v1)
```

That calltrace appears to indicate that exec'ing the process is re-establishing the OOM monitor for the cgroup on every call. We believe this issue was introduced in https://github.com/containerd/containerd/commit/6bcbf88f82e814f76ede351f48b57613540af425 – previously, the code looked like this:

```
// Start a process
func (s *service) Start(ctx context.Context, r *taskAPI.StartRequest) (*taskAPI.StartResponse, error) {
...
	if s.getCgroup() == nil && p.Pid() > 0 {
		cg, err := cgroups.Load(cgroups.V1, cgroups.PidPath(p.Pid()))
		if err != nil {
			logrus.WithError(err).Errorf("loading cgroup for %d", p.Pid())
		}
		s.setCgroup(cg)
...
func (s *service) setCgroup(cg cgroups.Cgroup) {
	s.mu.Lock()
	s.cg = cg
	s.mu.Unlock()
	if err := s.ep.add(s.id, cg); err != nil {
		logrus.WithError(err).Error("add cg to OOM monitor")
	}
}
```

The net effect of the `if s.getCgroup() == nil` check was that only the first call to `Start` was creating an OOM monitor eventfd. Now, it appears that every call to `Start` will re-add the cgroup to the monitor:

```
// Start a process
func (s *service) Start(ctx context.Context, r *taskAPI.StartRequest) (*taskAPI.StartResponse, error) {
...
	if err := s.ep.Add(container.ID, container.Cgroup()); err != nil {
		logrus.WithError(err).Error("add cg to OOM monitor")
	}
...
```

Since the ep (EventPoller) is only cleaning up those OOM monitors when the cgroup is deleted, this appears to be what's causing the leak.

[1] With a custom-built v1.3.0 containerd-shim produced by `GOOS=linux GOARCH=amd64 make bin/containerd-shim-runc-v1 GODEBUG=1` to preserve symbols

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containerd 1.3.0+ leaks an eventfd on every exec #3949

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

containerd 1.3.0+ leaks an eventfd on every exec #3949

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions