Skip to content

docker container stuck in starting state #37067

@kostrzewa9ld

Description

@kostrzewa9ld

I have created Docker Swarm cluster consisting of 3 manager nodes. Some of the services did not start with their containers being stuck in (health: starting) state:

[root@node-10-9-4-56 ~]# docker service ls
ID                  NAME                      MODE                REPLICAS          IMAGE          PORTS
rerbagvwsvlu        xxx                       replicated          0/1               xxx

[root@node-10-9-4-56 ~]# docker ps
CONTAINER ID        IMAGE              COMMAND                  CREATED             STATUS                           PORTS          NAMES
a934cbeb39f1        xxx                "/bin/sh -c bin/proc…"   2 hours ago         Up 2 hours (health: starting)                   xxx.1.1fjc1l1h2r0jyedt8rp3jb9m9

Steps to reproduce the issue:
This happens once thus far but what I did was:

  1. Create 3-node swarm (all managers)
  2. Deploy services

Describe the results you received:
One of the deployed service did not start successfully.

Describe the results you expected:
All service to start.

Additional information you deem important (e.g. issue happens only occasionally):
The container left some named pipes open in /var/run/docker/containerd/[CONTAINER_ID] directory. When I've read from [SOME_ID]-stderr pipe I got output looking as output from a failing healthcheck:

[root@node-10-9-4-56 ~]# ls -l /var/run/docker/containerd/a934cbeb39f114d4d79bbcec4f84c206420e9a9caf9907cc05b16013011329da/
total 0
prwx------ 1 root root 0 May 15 06:56 dd1b89ddd2bb9c3954e23345f5e3239bfa2447a89f311b1d6f20b6cd4c026485-stderr
prwx------ 1 root root 0 May 15 06:56 dd1b89ddd2bb9c3954e23345f5e3239bfa2447a89f311b1d6f20b6cd4c026485-stdout
prwx------ 1 root root 0 May 15 06:58 init-stderr
prwx------ 1 root root 0 May 15 06:58 init-stdout

[root@node-10-9-4-56 ~]# cat /var/run/docker/containerd/a934cbeb39f114d4d79bbcec4f84c206420e9a9caf9907cc05b16013011329da/dd1b89ddd2bb9c3954e23345f5e3239bfa2447a89f311b1d6f20b6cd4c026485-stderr
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 8080: Connection refused

(this service uses curl for healthchecks). After reading from the pipe the container started, the pipes disappeared and service was marked as running (then due to an un-related issue it was restarted on a different node).

Output of docker version:

Client:
 Version:	17.12.1-ce
 API version:	1.35
 Go version:	go1.9.4
 Git commit:	7390fc6
 Built:	Tue Feb 27 22:15:20 2018
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	17.12.1-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.4
  Git commit:	7390fc6
  Built:	Tue Feb 27 22:17:54 2018
  OS/Arch:	linux/amd64
  Experimental:	false

Output of docker info:

Containers: 12
 Running: 8
 Paused: 0
 Stopped: 4
Images: 13
Server Version: 17.12.1-ce
Storage Driver: devicemapper
 Pool Name: vg00-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Udev Sync Supported: true
 Data Space Used: 7.73GB
 Data Space Total: 21.47GB
 Data Space Available: 13.75GB
 Metadata Space Used: 1.704MB
 Metadata Space Total: 218.1MB
 Metadata Space Available: 216.4MB
 Thin Pool Minimum Free Space: 2.147GB
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: xxx
 Is Manager: true
 ClusterID: xxx
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 2
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.9.4.180
 Manager Addresses:
  10.9.4.56:2377
  10.9.4.180:2377
  10.9.4.184:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 3
Total Memory: 5.671GiB
Name: node-10-9-4-157
ID: KL7C:T2D7:OO2D:UKII:4NDF:KM7X:IEMK:WOCF:NTRY:XT44:6554:ADUR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 168
 Goroutines: 337
 System Time: 2018-05-15T09:07:01.265600845Z
 EventsListeners: 9
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 docker-registry:5000
 127.0.0.0/8
Registry Mirrors:
 http://docker-registry:5001/
Live Restore Enabled: false

WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.):
KVM virtual machines running with CentOS 7.2

From syslog:

May 15 06:54:03 localhost dockerd: time="2018-05-15T06:54:03.529245194Z" level=error msg="pulling image failed" error="pull access denied for web_ui, repository does not exist or may require 'docker login'" module=node/agent/taskmanager no
May 15 06:54:51 localhost dockerd: time="2018-05-15T06:54:51.835534703Z" level=warning msg="Health check for container a934cbeb39f114d4d79bbcec4f84c206420e9a9caf9907cc05b16013011329da error: context cancelled"
May 15 06:55:32 localhost dockerd: time="2018-05-15T06:55:32.837265478Z" level=error msg="stream copy error: reading from a closed fifo"
May 15 06:55:32 localhost dockerd: time="2018-05-15T06:55:32.837332467Z" level=error msg="stream copy error: reading from a closed fifo"
May 15 06:57:01 localhost dockerd: time="2018-05-15T06:57:01.230018663Z" level=error msg="stream copy error: reading from a closed fifo"
May 15 06:57:01 localhost dockerd: time="2018-05-15T06:57:01.230041441Z" level=error msg="stream copy error: reading from a closed fifo"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions