Skip to content

Zombie tasks on 17.11 #35594

@zigmund

Description

@zigmund

Description
On service update/restart/etc some old tasks stays semi-alive:

zigmund@docker-m1.alahd.kz.dev:~$ docker service ps stack-iagent-bugfix-iag-808-landing_app-cli 
ID                  NAME                                                IMAGE                                                                                                                                  NODE                     DESIRED STATE       CURRENT STATE            ERROR                              PORTS
nmup4588wtny        stack-iagent-bugfix-iag-808-landing_app-cli.1       registry:5000/stack/iagent/app@sha256:29255041db3e95d8159447a4fdbe3a111b3857309296365c427e303b3de81726   docker-w4.alahd.kz.dev   Running             Running 9 minutes ago                                       
yut8tf0hyfat         \_ stack-iagent-bugfix-iag-808-landing_app-cli.1   registry:5000/stack/iagent/app@sha256:29255041db3e95d8159447a4fdbe3a111b3857309296365c427e303b3de81726   docker-w1.alahd.kz.dev   Shutdown            Rejected 9 minutes ago   "Failed joining stack-iagent-b…"   
i2ogh9fo27ou         \_ stack-iagent-bugfix-iag-808-landing_app-cli.1   registry:5000/stack/iagent/app@sha256:29255041db3e95d8159447a4fdbe3a111b3857309296365c427e303b3de81726   docker-w2.alahd.kz.dev   Shutdown            Shutdown 9 minutes ago                                      
ruj69i7cqpi5         \_ stack-iagent-bugfix-iag-808-landing_app-cli.1   registry:5000/stack/iagent/app@sha256:29255041db3e95d8159447a4fdbe3a111b3857309296365c427e303b3de81726   docker-w4.alahd.kz.dev   Shutdown            Rejected 3 hours ago     "Failed joining stack-iagent-b…"

Record the task i2ogh9fo27ou and go to worker:

zigmund@docker-w2.alahd.kz.dev:~$ docker ps | grep i2ogh9fo27ou
c28c0b79670c        registry:5000/stack/iagent/app                                "/bin/sh -c ${ENTRYP…"   12 minutes ago      Up 11 minutes                                           stack-iagent-bugfix-iag-808-landing_app-cli.1.i2ogh9fo27ouxxjdqw5gom21c

Zombie task cannot be killed or removed by hand:

zigmund@docker-w2.alahd.kz.dev:~$ docker rm -f c28c0b79670c
Error response from daemon: Could not kill running container c28c0b79670c6aa998bb560cd6cf4251365187feb46453f6cb49819d86dfeede, cannot remove - Cannot kill container c28c0b79670c6aa998bb560cd6cf4251365187feb46453f6cb49819d86dfeede: process c28c0b79670c6aa998bb560cd6cf4251365187feb46453f6cb49819d86dfeede not found: not found

A lot of our services failed to update:

zigmund@docker-m1.alahd.kz.dev:~$ docker service ps stack-iagent-bugfix-iag-808-landing_app-web --no-trunc 
ID                          NAME                                                IMAGE                                                                                                                                  NODE                     DESIRED STATE       CURRENT STATE             ERROR                                                                                                                                                                                                                                                                                                                       PORTS
javaqsim2m9ujw7j6go5wcneu   stack-iagent-bugfix-iag-808-landing_app-web.1       registry:5000/stack/iagent/app@sha256:29255041db3e95d8159447a4fdbe3a111b3857309296365c427e303b3de81726   docker-w2.alahd.kz.dev   Shutdown            Rejected 14 minutes ago   "Failed joining stack-iagent-bugfix-iag-808-landing_internal-endpoint to sandbox stack-iagent-bugfix-iag-808-landing_internal-sbox: container stack-iagent-bugfix-iag-808-landing_internal-sbox: endpoint create on GW Network failed: endpoint with name gateway_stack-iagent already exists in network docker_gwbridge"   
ygpnrkf9ncjaq9yi2trxmdpvv    \_ stack-iagent-bugfix-iag-808-landing_app-web.1   registry:5000/stack/iagent/app@sha256:29255041db3e95d8159447a4fdbe3a111b3857309296365c427e303b3de81726   docker-w2.alahd.kz.dev   Shutdown            Rejected 14 minutes ago   "Failed joining stack-iagent-bugfix-iag-808-landing_internal-endpoint to sandbox stack-iagent-bugfix-iag-808-landing_internal-sbox: container stack-iagent-bugfix-iag-808-landing_internal-sbox: endpoint create on GW Network failed: endpoint with name gateway_stack-iagent already exists in network docker_gwbridge"   
xjysl4jnpanz7390mo3iq3ty9    \_ stack-iagent-bugfix-iag-808-landing_app-web.1   registry:5000/stack/iagent/app@sha256:29255041db3e95d8159447a4fdbe3a111b3857309296365c427e303b3de81726   docker-w2.alahd.kz.dev   Shutdown            Shutdown 3 hours ago                                                                                                                                                                                                                                                                                                                                  
z89ryyydrezpov6fg3d7kha2k    \_ stack-iagent-bugfix-iag-808-landing_app-web.1   registry:5000/stack/iagent/app@sha256:29255041db3e95d8159447a4fdbe3a111b3857309296365c427e303b3de81726   docker-w2.alahd.kz.dev   Shutdown            Failed 3 hours ago        "task: non-zero exit (1)"

Sometimes swarm shows zombie tasks:

zigmund@docker-m1.alahd.kz.dev:~$ docker service ls | grep stack-iagent-iag-1119_nginx
ngn4p1knm40g        stack-iagent-iag-1119_nginx                           replicated          3/1                 registry:5000/stack/iagent/nginx@sha256:2859e18e47b45211cf0cc062c4e9bc136cb01339e446a484726de310219db09d       
zigmund@docker-m1.alahd.kz.dev:~$ docker service ps stack-iagent-iag-1119_nginx
ID                  NAME                                IMAGE                                                                                                                                    NODE                     DESIRED STATE       CURRENT STATE             ERROR                              PORTS
jelsqcbrvd56        stack-iagent-iag-1119_nginx.1       registry:5000/stack/iagent/nginx@sha256:2859e18e47b45211cf0cc062c4e9bc136cb01339e446a484726de310219db09d   docker-w2.alahd.kz.dev   Running             Running 24 minutes ago                                       
jpz79v1808pu         \_ stack-iagent-iag-1119_nginx.1   registry:5000/stack/iagent/nginx@sha256:2859e18e47b45211cf0cc062c4e9bc136cb01339e446a484726de310219db09d   docker-w2.alahd.kz.dev   Shutdown            Running 25 minutes ago                                       
znar2fbopy71         \_ stack-iagent-iag-1119_nginx.1   registry:5000/stack/iagent/nginx@sha256:2859e18e47b45211cf0cc062c4e9bc136cb01339e446a484726de310219db09d   docker-w2.alahd.kz.dev   Shutdown            Rejected 26 minutes ago   "Failed creating stack-iagent-…"   
i0yowxygqid0         \_ stack-iagent-iag-1119_nginx.1   registry:5000/stack/iagent/nginx@sha256:2859e18e47b45211cf0cc062c4e9bc136cb01339e446a484726de310219db09d   docker-w3.alahd.kz.dev   Shutdown            Running 2 hours ago                                          
fuhh6f8gg6bg         \_ stack-iagent-iag-1119_nginx.1   registry:5000/stack/iagent/nginx@sha256:2859e18e47b45211cf0cc062c4e9bc136cb01339e446a484726de310219db09d   docker-w1.alahd.kz.dev   Shutdown            Running 3 hours ago

We have to update service over and over. At this point swarn on 17.11 completelly useless.

Steps to reproduce the issue:

  1. Form swarm cluster of few 17.11 docker nodes.
  2. Deploy some services (~50)
  3. Deploy/update/restart services, do what you do usually with swarm.
  4. Get zombie tasks and task allocation failures.

Describe the results you received:
Broken service updates.

Describe the results you expected:
Working service updates.

Additional information you deem important (e.g. issue happens only occasionally):
Recreated swarm from scratch two times, tried different kernel versions from 4.4 to 4.13 (read somewhere that instabillity might be related) - no luck.

Don't know if it related to overall network instabillity: #35592

Output of docker version:

Client:
 Version:      17.11.0-ce
 API version:  1.34
 Go version:   go1.8.3
 Git commit:   1caf76c
 Built:        Mon Nov 20 18:37:39 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.11.0-ce
 API version:  1.34 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   1caf76c
 Built:        Mon Nov 20 18:36:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 17.11.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: gelf
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: moelv396xsep51hmkqx18ozqj
 Is Manager: true
 ClusterID: os8oryt2yez9mxqy83jn8cv7s
 Managers: 3
 Nodes: 7
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.9.243.1
 Manager Addresses:
  10.9.243.1:2377
  10.9.243.2:2377
  10.9.243.3:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 992280e8e265f491f7a624ab82f3e238be086e49
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.8.0-58-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.953GiB
Name: docker-m1.alahd.kz.dev
ID: F3CI:R4DV:UPQB:JYXY:XAOT:R7FJ:7L7K:X7OG:GKTK:RS2L:IJXT:THVN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
Ubuntu 16.04.3 LTS 4.8.0-58-generic, Proxmox VM.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions