CRI stops receiving events, causes timeouts in StopContainer and StopSandboxContainer

```
# ctr version
Client:
  Version:  1.3.3
  Revision: d76c121f76a5fc8a462dc64594aea72fe18e1178

Server:
  Version:  1.3.3
  Revision: d76c121f76a5fc8a462dc64594aea72fe18e1178
  UUID: 32b2cc77-d405-4038-81f6-daf6752ab018
```

```
# kubelet --version
Kubernetes v1.17.
```

We've been seeing an issue where a single node gets stuck in a state where pods from CronJobs that should have terminated long ago show as "Running" in Kubernetes. When I look on the machine, the processes are dead, and the containerd task for the container is `STOPPED`.

We were finally able to catch this happening and take time to investigate, and we're seeing lots of log messages like this from containerd:

> time="2020-03-27T21:45:13.132058945Z" level=error msg="StopContainer for \"94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515\" failed" error="rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515\" to be killed: wait container \"94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515\": context deadline exceeded"

It seems like some of our users noticed their jobs were still running and were stuck (as far as they could tell) and tried to kill the job. containerd seems to have gone into a loop trying to stop the containers for those jobs, timing out every time.

I think that the reason these start failing is because for some reason, the CRI service is no longer receiving events from containerd. If I search my logs for `TaskExit`, I can see lots of messages right up until when the issue started, at which point they drop off completely. Without those events, the CRI service can't update the status of the container, and stopping the container relies on waiting for that status to update.

Happy to provide more information if you need it! This can be really frustrating for our users, as it's hard for us to notice before it's caused downstream negative effects for them. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRI stops receiving events, causes timeouts in StopContainer and StopSandboxContainer #1427

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CRI stops receiving events, causes timeouts in StopContainer and StopSandboxContainer #1427

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions