-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Description
This seems to be new to 17.12.
docker-runc hangs sometimes and needs to be killed in order for the rest of the system to continue to work properly. Because we run Kubernetes, the most noticeable symptom for us is kubelet on the host would start to show PLEG timeout. It appears the docker-shim responsible for the runc and the container stops responding.
We can still interact with docker for the most part, and I don't believe we see issues other than kubelet not able to report container events. docker ps -a would show the container ID with status created (but not docker ps, but docker inspect the container would hang. docker-containerd-ctr shows the container in stopped state.
Another thing that is strange is so far the affected container is kubernetes' infra container. I have yet to see other container being left in such state.
Steps to reproduce the issue:
Can't reproduce it reliably, but happens several times a day across 150 or so docker installations.
Describe the results you received:
docker-runc does not terminate, and docker inspect <cid> hangs.
Describe the results you expected:
docker-runc should terminate, and docker inspect <cid> should not hang.
Additional information you deem important (e.g. issue happens only occasionally):
Issue happens occasionally, but killing the hung docker-runc process restores the system.
Here is what strace of the stuck docker-runcc looks like
# strace -p 43620
strace: Process 43620 attached
openat(AT_FDCWD, "/var/run/docker/runtime-runc/moby/fc2da052c68f4f0a120184ad8eea49fec3e903dd6da9db848bb722f94fb25ba4/exec.fifo", O_RDONLY|O_CLOEXEC
Output of docker version:
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:10:14 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:12:46 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info:
Containers: 15
Running: 13
Paused: 0
Stopped: 2
Images: 13
Server Version: 17.12.0-ce
Storage Driver: devicemapper
Pool Name: docker-images
Pool Blocksize: 65.54kB
Base Device Size: 10.74GB
Backing Filesystem: xfs
Udev Sync Supported: true
Data Space Used: 3.437GB
Data Space Total: 797.8GB
Data Space Available: 794.4GB
Metadata Space Used: 8.884MB
Metadata Space Total: 566.2MB
Metadata Space Available: 557.3MB
Thin Pool Minimum Free Space: 79.78GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
seccomp
Profile: default
selinux
Kernel Version: 4.14.13-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 125.8GiB
Name: bnbvxk2
ID: L3JD:VTHD:IQKY:E5IJ:TA3F:PBKH:XYYQ:APCB:LWL3:SHVZ:QKMU:ACJ3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
techops.site=slukd1
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true
Additional environment details (AWS, VirtualBox, physical, etc.):
Servers are all bare-metal running CentOS 7.4.1708 with kernel 4.14.11 to 4.14.13.