Skip to content

overlay vxlanid_list is different across workers, Causes communication problem between tasks inside the overlay #31559

@yuval-p

Description

@yuval-p

Description

I have docker swarm cluster version 1.12.3, with 3 managers (in drain mode) and 10 workers.
I created 2 overlay network with the command:
docker network create -d overlay network1
docker network create -d overlay network2

and deployed several services, some attached to network1 and some on network2

One service (serviceA) on network1 is communicating with another service (serviceB) on the same network using the internal servicename LB

After couple of days, we scaled serviceA and noticed that the communication to some of the serviceB instances (tasks) is unreachable.

After investigating the problem, we found that on several workers, the parameter:
"com.docker.network.driver.overlay.vxlanid_list"
of network1 isn't consistent accross the whole cluster, in some hosts it was 258
and on some it was 257.

On the managers and most of the workers the value for network1 is 258, and
for network2 the value is 257

Steps to reproduce the issue:

  1. docker network create -d overlay network1
  2. docker network create -d overlay network2
  3. docker service create ... serviceA
  4. docker service create ... serviceB
  5. docker service scale serviceA=4

Describe the results you received:
Communication from serviceA to serviceB sometimes works, and sometimes not

telnet serviceB 8080
telnet: connect to address serviceB : Connection refused

Describe the results you expected:

telnet serviceB 8080
Connected to ServiceB

Additional information you deem important (e.g. issue happens only occasionally):
There is no firewall between the servers
At first, we installed 1.12.1 but upgraded to 1.12.3
the linux is redhat 7.3
And as i said, the problem is occasionally accord, depends on the value of "com.docker.network.driver.overlay.vxlanid_list" on the worker host that the destination task is running

Output of docker version:

Client:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built:
OS/Arch: linux/amd64

Server:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built:
OS/Arch: linux/amd64

Output of docker info:

Unfortunately I do not have the output available now

Additional environment details (AWS, VirtualBox, physical, etc.):
All the hosts are virtual machine on top of vmware

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions