Skip to content

connections get "stuck" in swarm between wildfly and postgres #37466

@yaskoo

Description

@yaskoo

I'm not sure if this is actually a bug or a question in general, but the behavior is strange.
We're deploying a compose file to a swarm with three nodes - one master and two workers.
Swarm is initialized with docker swarm init --advertise-addr eth0 and workers were joined using docker swarm join --token=worker-join-token manager-address.

Compose file contains 19 services all using the default network. Two of the services are a wildfly application service version 9.0.2.Final and postgres version 9.5. The wildfly service is deployed on the master and postgres on one of the workers.
The application seems to work ok for a while, but as far as we can tell, when more people start using it (not that many actually - 3, maybe 4) the application gets "stuck".

Running netstat inside the wildfly container we see some connections that are in ESTABLISHED state, but the Send-Q shows 52 bytes. These never get send until eventually something kills the connection. I read somewhere that these could be ACK packets, but I'm not sure.

tcp        0     52 10.0.0.72:59338         10.0.0.37:5432          ESTABLISHED -

On the postgres side the same connection is also established, but the send and receive queues are empty. The problem seems to be that wildfly doesn't understand that these connections are stuck for some reason and keep using them.

We found two ways to workaround this issue and that's why I'm creating an issue here.

  1. move the database service to the same node where wildfly is deployed (on the master)
  2. declare a separate network for the database and add it to postgres and all the other services that use it

Is it possible that the default network gets overwhelmed? But if that's the case why does it work when the containers are on the same node?
Has anyone seen anything like that or could provide pointers on how to debug this further?

Additional information you deem important (e.g. issue happens only occasionally):
Services connect to each other using their service name.

Output of docker version:

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:20:16 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:23:58 2018
  OS/Arch:      linux/amd64
  Experimental: true

Output of docker info:

Containers: 55
 Running: 18
 Paused: 0
 Stopped: 37
Images: 20
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: ee4plb1trccarzirq6bhvh7n0
 Is Manager: true
 ClusterID: 8k53sehy82d0z38tu9xb87at8
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 172.16.2.114
 Manager Addresses:
  172.16.2.114:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-862.3.3.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 29.45GiB
Name: ...
ID: WW5X:QZQX:PR4B:LCNS:CUE7:DVPZ:NM65:5DPP:ORKJ:RIPQ:EN7K:PDGD
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
All nodes are running CentOS 7 with kernel 3.10.0. We also tried updating the kernel to 4.17.5, but that didn't fix the problem.

We also have some sysctl tunables set:

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_timestamps = 0

the first three were needed because of this

We're also experiencing this on a similar environment deployed in azure.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions