-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
I'm not sure if this is actually a bug or a question in general, but the behavior is strange.
We're deploying a compose file to a swarm with three nodes - one master and two workers.
Swarm is initialized with docker swarm init --advertise-addr eth0 and workers were joined using docker swarm join --token=worker-join-token manager-address.
Compose file contains 19 services all using the default network. Two of the services are a wildfly application service version 9.0.2.Final and postgres version 9.5. The wildfly service is deployed on the master and postgres on one of the workers.
The application seems to work ok for a while, but as far as we can tell, when more people start using it (not that many actually - 3, maybe 4) the application gets "stuck".
Running netstat inside the wildfly container we see some connections that are in ESTABLISHED state, but the Send-Q shows 52 bytes. These never get send until eventually something kills the connection. I read somewhere that these could be ACK packets, but I'm not sure.
tcp 0 52 10.0.0.72:59338 10.0.0.37:5432 ESTABLISHED -
On the postgres side the same connection is also established, but the send and receive queues are empty. The problem seems to be that wildfly doesn't understand that these connections are stuck for some reason and keep using them.
We found two ways to workaround this issue and that's why I'm creating an issue here.
- move the database service to the same node where wildfly is deployed (on the master)
- declare a separate network for the database and add it to postgres and all the other services that use it
Is it possible that the default network gets overwhelmed? But if that's the case why does it work when the containers are on the same node?
Has anyone seen anything like that or could provide pointers on how to debug this further?
Additional information you deem important (e.g. issue happens only occasionally):
Services connect to each other using their service name.
Output of docker version:
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:20:16 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:23:58 2018
OS/Arch: linux/amd64
Experimental: true
Output of docker info:
Containers: 55
Running: 18
Paused: 0
Stopped: 37
Images: 20
Server Version: 18.03.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: ee4plb1trccarzirq6bhvh7n0
Is Manager: true
ClusterID: 8k53sehy82d0z38tu9xb87at8
Managers: 1
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 172.16.2.114
Manager Addresses:
172.16.2.114:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-862.3.3.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 29.45GiB
Name: ...
ID: WW5X:QZQX:PR4B:LCNS:CUE7:DVPZ:NM65:5DPP:ORKJ:RIPQ:EN7K:PDGD
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
All nodes are running CentOS 7 with kernel 3.10.0. We also tried updating the kernel to 4.17.5, but that didn't fix the problem.
We also have some sysctl tunables set:
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_timestamps = 0
the first three were needed because of this
We're also experiencing this on a similar environment deployed in azure.