Skip to content

NetworkDB does not always reliably converge #47728

@s4ke

Description

@s4ke

Description

In our setups, we keep having issues around Docker DNS resolution around times where we either:

  1. restart Docker nodes in quick succession
  2. update Docker nodes (and therefore restart them in quick succession)
  3. have network issues

For a second we thought that the MTU settings for the networking controlplane might be the issue, but the issue seems to have happened even on an MTU that fits our setup (we used 1350 instead of the default 1500).

It seems that during these times, the gossip network of networkdb does not get synched up properly. We debugged this with the built in debugging tooling of libnetwork (which is really helpful btw) and found that there exists no mechanism in dockerd that concerns itself with resynching the endpoint_table of networkdb (or overlay_peer_table for that matter but that does not seem to be that much of a problem). We double checked this by going through the code of libnetwork/agent.go:

The only places that update anything in networkdb are calls to addServiceInfoToCluster (CreateEntry), addDriverInfoToCluster (CreateEntry), deleteDriverInfoFromCluster (DeleteEntry), deleteServiceInfoFromCluster (DeleteEntry). disableServiceInNetworkDB (UpdateEntry).

While we might have missed things, here comes our proposal: There should be a (opt-in) docker daemon config option that enables a background job in the docker daemon that on a schedule resyncs all the DNS entries to networkdb. Design wise I have not thought about it a lot, but I imagine that this should be fine to run on a schedule of maybe 1-5 minutes in most clusters. This way, whenever a DNS entry is out of sync things should fix themselves in a somewhat acceptable schedule.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions