Skip to content

Issue with containerStop and long running tasks prematurely exiting #45731

@srisch

Description

@srisch

Description

It seems that behavior has changed in docker 23+ in regards to containerStop and stopping long running containers via the API.

Current behavior it seems when sending a POST to /stop?t=3600 the connection doesn't return a 204 until the container has fully stopped, so the connection in theory stays open for an hour. If this connection is dropped then the container is killed.

Here's an example log of initiating the request and cancelling the curl.

Jun 12 15:24:41 dg11 dockerd[1263909]: time="2023-06-12T15:24:41.686151349Z" level=debug msg="Calling POST /v1.41/containers/1e70fb6382cb/stop?t=3600"
Jun 12 15:24:41 dg11 dockerd[1263909]: time="2023-06-12T15:24:41.686276060Z" level=debug msg="Sending kill signal 15 to container 1e70fb6382cb3d323f92ba2477d9430ea870f341e61af5623d02dc5056d6af7e"
Jun 12 15:30:52 dg11 dockerd[1263909]: time="2023-06-12T15:30:52.095812601Z" level=info msg="Container failed to exit within 1h0m0s of signal 15 - using the force" container=1e70fb6382cb3d323f92ba2477d9430ea870f341e61af5623d02dc5056d6af7e
Jun 12 15:30:52 dg11 dockerd[1263909]: time="2023-06-12T15:30:52.095878637Z" level=debug msg="Sending kill signal 9 to container 1e70fb6382cb3d323f92ba2477d9430ea870f341e61af5623d02dc5056d6af7e"

Docker 20.10 seems to return the 204 status code before the container is stopped, as our nomad client times out sending the POST request on docker 23+.

Reproduce

Start a long running container that emulates something like Haproxy spinning down open connections and tell it to stop.

curl -v --unix-socket /var/run/docker.sock -X POST http://localhost/v1.43/containers/1e70fb6382cb/stop?t=3600

If you cancel the curl request the container will be killed immediately.

This is the client code we initially noticed the bug with, on docker 23+ the request times out, on 20.10 our stop behaves as normal. https://github.com/fsouza/go-dockerclient/blob/main/container_stop.go

### Expected behavior

Is this the expected behavior of this endpoint?

I plan on also opening an issue with the nomad team as well as this is the client I originally noticed the issue with, but wanted to see if this was expected behavior or not as it seems to have changed between docker versions.

### docker version

```bash
Client: Docker Engine - Community
 Version:           24.0.2
 API version:       1.43
 Go version:        go1.20.4
 Git commit:        cb74dfc
 Built:             Thu May 25 21:52:22 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.2
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       659604f
  Built:            Thu May 25 21:52:22 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.21
  GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
 nvidia:
  Version:          1.1.7
  GitCommit:        v1.1.7-0-g860f061
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client: Docker Engine - Community
 Version:    24.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.5
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.18.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 46
  Running: 27
  Paused: 0
  Stopped: 19
 Images: 56
 Server Version: 24.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: 3lm7d688ghjrw9zr0sqpmkdrx
  Is Manager: true
  ClusterID: 87z2x0y7sdhccy5ch5wzepop7
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.0.5.10
  Manager Addresses:
   10.0.5.10:2377
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.4.0-137-generic
 Operating System: Ubuntu 20.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 24
 Total Memory: 125.8GiB
 Name: dg11
 ID: b1793812-e978-432b-a86d-5f548d7da15f
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 295
  Goroutines: 535
  System Time: 2023-06-12T15:32:00.856319114Z
  EventsListeners: 1
 Username: 
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
 Live Restore Enabled: false

Additional Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions