-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
I have created Docker Swarm cluster consisting of 3 manager nodes. Some of the services did not start with their containers being stuck in (health: starting) state:
[root@node-10-9-4-56 ~]# docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
rerbagvwsvlu xxx replicated 0/1 xxx
[root@node-10-9-4-56 ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a934cbeb39f1 xxx "/bin/sh -c bin/proc…" 2 hours ago Up 2 hours (health: starting) xxx.1.1fjc1l1h2r0jyedt8rp3jb9m9
Steps to reproduce the issue:
This happens once thus far but what I did was:
- Create 3-node swarm (all managers)
- Deploy services
Describe the results you received:
One of the deployed service did not start successfully.
Describe the results you expected:
All service to start.
Additional information you deem important (e.g. issue happens only occasionally):
The container left some named pipes open in /var/run/docker/containerd/[CONTAINER_ID] directory. When I've read from [SOME_ID]-stderr pipe I got output looking as output from a failing healthcheck:
[root@node-10-9-4-56 ~]# ls -l /var/run/docker/containerd/a934cbeb39f114d4d79bbcec4f84c206420e9a9caf9907cc05b16013011329da/
total 0
prwx------ 1 root root 0 May 15 06:56 dd1b89ddd2bb9c3954e23345f5e3239bfa2447a89f311b1d6f20b6cd4c026485-stderr
prwx------ 1 root root 0 May 15 06:56 dd1b89ddd2bb9c3954e23345f5e3239bfa2447a89f311b1d6f20b6cd4c026485-stdout
prwx------ 1 root root 0 May 15 06:58 init-stderr
prwx------ 1 root root 0 May 15 06:58 init-stdout
[root@node-10-9-4-56 ~]# cat /var/run/docker/containerd/a934cbeb39f114d4d79bbcec4f84c206420e9a9caf9907cc05b16013011329da/dd1b89ddd2bb9c3954e23345f5e3239bfa2447a89f311b1d6f20b6cd4c026485-stderr
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 8080: Connection refused
(this service uses curl for healthchecks). After reading from the pipe the container started, the pipes disappeared and service was marked as running (then due to an un-related issue it was restarted on a different node).
Output of docker version:
Client:
Version: 17.12.1-ce
API version: 1.35
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:15:20 2018
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.1-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:17:54 2018
OS/Arch: linux/amd64
Experimental: false
Output of docker info:
Containers: 12
Running: 8
Paused: 0
Stopped: 4
Images: 13
Server Version: 17.12.1-ce
Storage Driver: devicemapper
Pool Name: vg00-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 10.74GB
Backing Filesystem: xfs
Udev Sync Supported: true
Data Space Used: 7.73GB
Data Space Total: 21.47GB
Data Space Available: 13.75GB
Metadata Space Used: 1.704MB
Metadata Space Total: 218.1MB
Metadata Space Available: 216.4MB
Thin Pool Minimum Free Space: 2.147GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: xxx
Is Manager: true
ClusterID: xxx
Managers: 3
Nodes: 3
Orchestration:
Task History Retention Limit: 2
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.9.4.180
Manager Addresses:
10.9.4.56:2377
10.9.4.180:2377
10.9.4.184:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 3
Total Memory: 5.671GiB
Name: node-10-9-4-157
ID: KL7C:T2D7:OO2D:UKII:4NDF:KM7X:IEMK:WOCF:NTRY:XT44:6554:ADUR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 168
Goroutines: 337
System Time: 2018-05-15T09:07:01.265600845Z
EventsListeners: 9
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
docker-registry:5000
127.0.0.0/8
Registry Mirrors:
http://docker-registry:5001/
Live Restore Enabled: false
WARNING: bridge-nf-call-ip6tables is disabled
Additional environment details (AWS, VirtualBox, physical, etc.):
KVM virtual machines running with CentOS 7.2
From syslog:
May 15 06:54:03 localhost dockerd: time="2018-05-15T06:54:03.529245194Z" level=error msg="pulling image failed" error="pull access denied for web_ui, repository does not exist or may require 'docker login'" module=node/agent/taskmanager no
May 15 06:54:51 localhost dockerd: time="2018-05-15T06:54:51.835534703Z" level=warning msg="Health check for container a934cbeb39f114d4d79bbcec4f84c206420e9a9caf9907cc05b16013011329da error: context cancelled"
May 15 06:55:32 localhost dockerd: time="2018-05-15T06:55:32.837265478Z" level=error msg="stream copy error: reading from a closed fifo"
May 15 06:55:32 localhost dockerd: time="2018-05-15T06:55:32.837332467Z" level=error msg="stream copy error: reading from a closed fifo"
May 15 06:57:01 localhost dockerd: time="2018-05-15T06:57:01.230018663Z" level=error msg="stream copy error: reading from a closed fifo"
May 15 06:57:01 localhost dockerd: time="2018-05-15T06:57:01.230041441Z" level=error msg="stream copy error: reading from a closed fifo"