# ctr version
Client:
Version: 1.3.3
Revision: d76c121f76a5fc8a462dc64594aea72fe18e1178
Server:
Version: 1.3.3
Revision: d76c121f76a5fc8a462dc64594aea72fe18e1178
UUID: 32b2cc77-d405-4038-81f6-daf6752ab018
# kubelet --version
Kubernetes v1.17.
We've been seeing an issue where a single node gets stuck in a state where pods from CronJobs that should have terminated long ago show as "Running" in Kubernetes. When I look on the machine, the processes are dead, and the containerd task for the container is STOPPED.
We were finally able to catch this happening and take time to investigate, and we're seeing lots of log messages like this from containerd:
time="2020-03-27T21:45:13.132058945Z" level=error msg="StopContainer for "94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515" failed" error="rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container "94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515" to be killed: wait container "94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515": context deadline exceeded"
It seems like some of our users noticed their jobs were still running and were stuck (as far as they could tell) and tried to kill the job. containerd seems to have gone into a loop trying to stop the containers for those jobs, timing out every time.
I think that the reason these start failing is because for some reason, the CRI service is no longer receiving events from containerd. If I search my logs for TaskExit, I can see lots of messages right up until when the issue started, at which point they drop off completely. Without those events, the CRI service can't update the status of the container, and stopping the container relies on waiting for that status to update.
Happy to provide more information if you need it! This can be really frustrating for our users, as it's hard for us to notice before it's caused downstream negative effects for them.