Skip to content

Service update rollback with "start-first" replaces existing healthy tasks #34111

@sirlatrom

Description

@sirlatrom

Description
In Docker 17.05.0-ce (#30261), the --update-failure-action=rollback option was introduced. If an update with the options --update-order=start-first --update-failure-action=rollback --update-parallelism=0 --rollback-order=start-first fails, the previously running, healthy containers are replaced with new ones as part of the rollback, instead of keeping the old ones. That causes a window of downtime if an update fails and is rolled back.

Steps to reproduce the issue:

  1. Create two images which sleep for 10 seconds before running nginx. One has 1 health check retry, the other has 15. The health check interval is 1 second for each.
cat > main.sh << EOF
#!/bin/sh -ex
sleep 10
touch /health
nginx -g "daemon off;"
EOF
chmod +x main.sh
cat > Dockerfile.1retry << EOF
FROM nginx:alpine
HEALTHCHECK --interval=1s --retries=1 CMD test -r /health || exit 1
COPY main.sh /main.sh
ENTRYPOINT /main.sh
CMD ["nginx", "-g", "daemon off;"]
EOF
docker build -t 10s:1retry -f Dockerfile.1retry .
cat > Dockerfile.15retries << EOF
FROM 10s:1retry
HEALTHCHECK --interval=1s --retries=15 CMD test -r /health || exit 1
EOF
docker build -t 10s:15retries -f Dockerfile.15retries .
  1. Create the service and wait for it to be healthy:
docker service create --detach=false --name=10s --replicas=2 --update-monitor=1s --update-failure-action=rollback --update-order=start-first --rollback-order=start-first --rollback-monitor=1s --publish 80:80 10s:15retries nginx -g "daemon off;"
  1. Start separate shells for watch -n 1 docker ps, watch -n 1 docker service ps 10s, watch -n 1 docker service inspect --pretty 10s, watch -n 1 curl -sS localhost, and docker service logs -f 10s to see what's going on
  2. Update the service with the 1 retry image and observe downtime:
docker service update --detach=false --image 10s:1retry --rollback-order=start-first --rollback-parallelism=1 --update-parallelism=0 --rollback-order=start-first 10s

Describe the results you received:

  1. The containers/tasks running prior to the update are stopped and removed as part of the rollback
  2. There is an extended window of downtime as experienced when curling the ingress port

I attached a GIF of what I experienced.
fail

Describe the results you expected:

  1. The healthy containers running prior to the update should simply stay in place
  2. I should not experience any downtime when curling on the ingress port

Additional information you deem important (e.g. issue happens only occasionally):
Nope.

Output of docker version:

Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 53
Server Version: 17.06.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: j3bnlf3fdmrfmtth4cmsmfbrf
 Is Manager: true
 ClusterID: vaiauv39cxsnfq0r14014lwi5
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: 10.108.103.132
 Manager Addresses:
  10.108.103.132:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.12.1-041201-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.55GiB
Name: lx64pc0265
ID: 3FQU:UDEZ:SNTQ:AE34:UABY:5FO2:DRHR:LC2U:RY7E:NMNL:PM65:DYH5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://localhost:3128/
Https Proxy: http://localhost:3128/
No Proxy: localhost,127.0.0.0/8
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
Physical bare metal workstation. Yes I know I'm on Linux 4.12.1, but that's hardly likely to be the issue here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions