Inconsistent state on pod termination

**Description**

We just had an issue with containerd: an application was killed several times by the oom killer because it reached its cgroup memory limit. Containers on the host are now in a really weird state:
- ok according to `crictl ps`
- `crictl exec` fails with `cannot exec in a stopped state: unknown`
- `ctr -n k8s.io t ls` hangs without any output
- `ps auxf` shows many containerd-shim without any child process (or sometime only the pause container)
- `runc --root /run/containerd/runc/k8s.io list` shows some containers in `stopped` state
- the associated `containerd-shim` process is still running without any child

It seems that sometimes when a container process is oom-killed because it has reached its cgroup memory limit the containerd state becomes inconsistent. Once this has happened it's no longer possible to delete containers. When trying to delete a pod, the containerd logs show:
- containerd tries to stop it (StopContainer)
- stop container xx timed out
- then error=“an error occurs during waiting for container xxx to stop: wait container xxx is cancelled”
- the container is stopped but not removed

**Steps to reproduce the issue:**
1. Run kubernetes using containerd as CRI
2. Create a pod with a memory limit
3. Allocate more memory than the limit
4. After several OOM kills, it should no longer be possible to interact with containerd

**Describe the results you received:**
containerd seems to be stuck in a inconsistent state and no longer able to fulfill CRI requests

**Describe the results you expected:**
containerd should clean up oom killed containers and remain consistent

**Output of `containerd --version`:**
```
containerd --version
containerd github.com/containerd/containerd v1.1.0 209a7fc3e4a32ef71a8c7b50c68fc8398415badf```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent state on pod termination #2438

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent state on pod termination #2438

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions