Skip to content

Pauses/delays with overlay network on swarm #31746

@ev0rtex

Description

@ev0rtex

Description

With a simple node.js web app (as simple as it can get pretty much) I was using siege to test and see what kind of throughput I can get on a swarm (a 7-node swarm set up on my company's vmware cluster). I was noticing that my results were seemingly sporadic so I kept digging. I narrowed things down quite a bit by creating a single-node swarm on my personal computer and using siege with 1 concurrency for 15 seconds (siege -b -t 15S -c 1). I would say a good portion of the time I could get consistently fast responses but then periodically the throughput gets cut to exactly 1 request per second for a short time.

Steps to reproduce the issue:

  1. Bring up a single-node swarm and create an overlay network for service:
    docker swarm init
    docker network create swarmtest_network --driver overlay
    
  2. Build the docker image:
    server.js
    let http = require('http')
    
    let server = http.createServer((req, res) => {
      res.end('cool beans', 'utf-8')
    })
    
    server.listen(3001, '0.0.0.0', err => {
      if (err) {
        console.error(err.message)
        process.exit(1)
      }
    
      console.log('Server is running on port 3001')
    })
    Dockerfile
    FROM node:7-slim
    COPY server.js /server.js
    CMD  node server.js
    
    docker build -t myregistry/swarm-test:latest .
    
  3. Start the service on the swarm with the overlay network we created:
    docker service create --replicas 1 --name swarm-test --network swarmtest_network --publish 3001:3001 docker.ddm.io/swarm-test:latest
    
  4. Try to load test the service using siege:
    siege -b -t 15S -c 1 http://localhost:3001
    

Describe the results you received:
At some points I get lots of requests through just as I do when I'm running the node process on my host outside of docker:

Transactions:                   5079 hits
Availability:                 100.00 %
Elapsed time:                  14.91 secs
Data transferred:               0.05 MB
Response time:                  0.00 secs
Transaction rate:             340.64 trans/sec
Throughput:                     0.00 MB/sec
Concurrency:                    0.98
Successful transactions:        5079
Failed transactions:               0
Longest transaction:            0.10

Whatever happens at some point causes my requests to come back at about 1/second:

The server is now under siege...
HTTP/1.1 200     1.08 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /
HTTP/1.1 200     1.04 secs:      10 bytes ==> GET  /

Lifting the server siege...
Transactions:                     13 hits
Availability:                 100.00 %
Elapsed time:                  14.21 secs
Data transferred:               0.00 MB
Response time:                  1.04 secs
Transaction rate:               0.91 trans/sec
Throughput:                     0.00 MB/sec
Concurrency:                    0.95
Successful transactions:          13
Failed transactions:               0
Longest transaction:            1.08
Shortest transaction:           1.04

Describe the results you expected:
I would expect the performance to be consistent on the swarm. Periodic slowdowns like this are not expected.

Additional information you deem important (e.g. issue happens only occasionally):
When I'm watching CPU usage of the process inside the container the CPU usage drops completely during the slow response times so it's not a CPU-bound issue...at least not with regard to the containerized process. I have tried this on a swarm with 3 managers and 4 workers as well as a smaller 3 node swarm and here with a single node swarm locally on my system. The same issue occurs in all cases.

I tried running an instance using the bridge network instead of running as a service on the swarm and that works perfectly - after running my siege for quite awhile I could not get it to slow down like I'm seeing with the swarm/overlay network.

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:40:59 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 07:52:04 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 17.03.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: active
 NodeID: 3imbeve09kqpiy9lx2gcz5qat
 Is Manager: true
 ClusterID: ku0y69l21tz1edsgex49a2yao
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 192.168.65.2
 Manager Addresses:
  192.168.65.2:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.12-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.952 GiB
Name: moby
ID: 73SZ:2PB7:FGPJ:PD3L:ZANW:6G63:A7WF:HJB5:C5JM:RHED:TSCI:FJLQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 47
 Goroutines: 154
 System Time: 2017-03-10T19:47:51.213052386Z
 EventsListeners: 2
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
MacBook Pro 2016/macOS Sierra (for single-node swarm as used for these results)
Running latest Mac version of Docker: 17.03.0-ce-mac2

On the multi-node swarms I was using Debian Jessie VM's on our company's VMWare cluster with the same version of Docker.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions