Skip to content
This repository was archived by the owner on Mar 9, 2022. It is now read-only.
This repository was archived by the owner on Mar 9, 2022. It is now read-only.

CRI stops receiving events, causes timeouts in StopContainer and StopSandboxContainer #1427

@mmoriarity-stripe

Description

@mmoriarity-stripe
# ctr version
Client:
  Version:  1.3.3
  Revision: d76c121f76a5fc8a462dc64594aea72fe18e1178

Server:
  Version:  1.3.3
  Revision: d76c121f76a5fc8a462dc64594aea72fe18e1178
  UUID: 32b2cc77-d405-4038-81f6-daf6752ab018
# kubelet --version
Kubernetes v1.17.

We've been seeing an issue where a single node gets stuck in a state where pods from CronJobs that should have terminated long ago show as "Running" in Kubernetes. When I look on the machine, the processes are dead, and the containerd task for the container is STOPPED.

We were finally able to catch this happening and take time to investigate, and we're seeing lots of log messages like this from containerd:

time="2020-03-27T21:45:13.132058945Z" level=error msg="StopContainer for "94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515" failed" error="rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container "94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515" to be killed: wait container "94ade1615964afc99b3135f6aaefb4593da0ebc3357f674ce3181ccc569cd515": context deadline exceeded"

It seems like some of our users noticed their jobs were still running and were stuck (as far as they could tell) and tried to kill the job. containerd seems to have gone into a loop trying to stop the containers for those jobs, timing out every time.

I think that the reason these start failing is because for some reason, the CRI service is no longer receiving events from containerd. If I search my logs for TaskExit, I can see lots of messages right up until when the issue started, at which point they drop off completely. Without those events, the CRI service can't update the status of the container, and stopping the container relies on waiting for that status to update.

Happy to provide more information if you need it! This can be really frustrating for our users, as it's hard for us to notice before it's caused downstream negative effects for them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions