-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Description
We just had an issue with containerd: an application was killed several times by the oom killer because it reached its cgroup memory limit. Containers on the host are now in a really weird state:
- ok according to
crictl ps crictl execfails withcannot exec in a stopped state: unknownctr -n k8s.io t lshangs without any outputps auxfshows many containerd-shim without any child process (or sometime only the pause container)runc --root /run/containerd/runc/k8s.io listshows some containers instoppedstate- the associated
containerd-shimprocess is still running without any child
It seems that sometimes when a container process is oom-killed because it has reached its cgroup memory limit the containerd state becomes inconsistent. Once this has happened it's no longer possible to delete containers. When trying to delete a pod, the containerd logs show:
- containerd tries to stop it (StopContainer)
- stop container xx timed out
- then error=“an error occurs during waiting for container xxx to stop: wait container xxx is cancelled”
- the container is stopped but not removed
Steps to reproduce the issue:
- Run kubernetes using containerd as CRI
- Create a pod with a memory limit
- Allocate more memory than the limit
- After several OOM kills, it should no longer be possible to interact with containerd
Describe the results you received:
containerd seems to be stuck in a inconsistent state and no longer able to fulfill CRI requests
Describe the results you expected:
containerd should clean up oom killed containers and remain consistent
Output of containerd --version:
containerd --version
containerd github.com/containerd/containerd v1.1.0 209a7fc3e4a32ef71a8c7b50c68fc8398415badf```