Service update rollback with "start-first" replaces existing healthy tasks

**Description**
In Docker 17.05.0-ce (#30261), the `--update-failure-action=rollback` option was introduced. If an update with the options `--update-order=start-first --update-failure-action=rollback --update-parallelism=0 --rollback-order=start-first` fails, the previously running, healthy containers are replaced with new ones as part of the rollback, instead of keeping the old ones. That causes a window of downtime if an update fails and is rolled back.



**Steps to reproduce the issue:**
1. Create two images which sleep for 10 seconds before running nginx. One has 1 health check retry, the other has 15. The health check interval is 1 second for each.
```bash
cat > main.sh << EOF
#!/bin/sh -ex
sleep 10
touch /health
nginx -g "daemon off;"
EOF
chmod +x main.sh
cat > Dockerfile.1retry << EOF
FROM nginx:alpine
HEALTHCHECK --interval=1s --retries=1 CMD test -r /health || exit 1
COPY main.sh /main.sh
ENTRYPOINT /main.sh
CMD ["nginx", "-g", "daemon off;"]
EOF
docker build -t 10s:1retry -f Dockerfile.1retry .
cat > Dockerfile.15retries << EOF
FROM 10s:1retry
HEALTHCHECK --interval=1s --retries=15 CMD test -r /health || exit 1
EOF
docker build -t 10s:15retries -f Dockerfile.15retries .
```
2. Create the service and wait for it to be healthy:
```bash
docker service create --detach=false --name=10s --replicas=2 --update-monitor=1s --update-failure-action=rollback --update-order=start-first --rollback-order=start-first --rollback-monitor=1s --publish 80:80 10s:15retries nginx -g "daemon off;"
```
3. Start separate shells for `watch -n 1 docker ps`, `watch -n 1 docker service ps 10s`, `watch -n 1 docker service inspect --pretty 10s`, `watch -n 1 curl -sS localhost`, and `docker service logs -f 10s` to see what's going on
4. Update the service with the 1 retry image and observe downtime:
```bash
docker service update --detach=false --image 10s:1retry --rollback-order=start-first --rollback-parallelism=1 --update-parallelism=0 --rollback-order=start-first 10s
```

**Describe the results you received:**
1. The containers/tasks running prior to the update are stopped and removed as part of the rollback
2. There is an extended window of downtime as experienced when `curl`ing the ingress port

I attached a GIF of what I experienced.
![fail](https://user-images.githubusercontent.com/425633/28209462-e4ebc972-6892-11e7-9504-9b9a6e6916ba.gif)

**Describe the results you expected:**
1. The healthy containers running prior to the update should simply stay in place
2. I should not experience any downtime when `curl`ing on the ingress port

**Additional information you deem important (e.g. issue happens only occasionally):**
Nope.

**Output of `docker version`:**

```yaml
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: true
```

**Output of `docker info`:**

```yaml
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 53
Server Version: 17.06.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: j3bnlf3fdmrfmtth4cmsmfbrf
 Is Manager: true
 ClusterID: vaiauv39cxsnfq0r14014lwi5
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: 10.108.103.132
 Manager Addresses:
  10.108.103.132:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.12.1-041201-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.55GiB
Name: lx64pc0265
ID: 3FQU:UDEZ:SNTQ:AE34:UABY:5FO2:DRHR:LC2U:RY7E:NMNL:PM65:DYH5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://localhost:3128/
Https Proxy: http://localhost:3128/
No Proxy: localhost,127.0.0.0/8
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
```

**Additional environment details (AWS, VirtualBox, physical, etc.):**
Physical bare metal workstation. Yes I know I'm on Linux 4.12.1, but that's hardly likely to be the issue here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service update rollback with "start-first" replaces existing healthy tasks #34111

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Service update rollback with "start-first" replaces existing healthy tasks #34111

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions