Skip to content

libnetwork: fix flaky Swarm service DNS#50229

Merged
thaJeztah merged 1 commit intomoby:masterfrom
corhere:libn/fix-networkdb-dns-update-delete
Jun 19, 2025
Merged

libnetwork: fix flaky Swarm service DNS#50229
thaJeztah merged 1 commit intomoby:masterfrom
corhere:libn/fix-networkdb-dns-update-delete

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Jun 18, 2025

- What I did
- How I did it

When libnetwork receives a watch event for a driver table entry from NetworkDB it passes the event along to the interested driver. This code contains a subtle bug: update events from NetworkDB are passed along to the driver as Delete events! This bug was lying dormant as driver-table entries can only be added by the driver, not updated. Now that NetworkDB broadcasts an UpdateEvent to watchers if the entry is already known to the local NetworkDB, irrespective of whether the event received from the remote peer was a CREATE or UPDATE event, the bug is causing problems. Whenever a remote node replaces an entry in the overlay_peer_table but the intermediate delete state was not received by the local node, the new CREATE event would be translated to an UpdateEvent by NetworkDB and subsequently handled by the overlay driver as if the entry was deleted!

Bubble table UPDATE events up to the network driver as Update events.

- How to verify it
By inspection? I can't find a way to reproducibly trigger the buggy behaviour.

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

When libnetwork receives a watch event for a driver table entry from
NetworkDB it passes the event along to the interested driver. This code
contains a subtle bug: update events from NetworkDB are passed along to
the driver as Delete events! This bug was lying dormant as driver-table
entries can only be added by the driver, not updated. Now that NetworkDB
broadcasts an UpdateEvent to watchers if the entry is already known to
the local NetworkDB, irrespective of whether the event received from the
remote peer was a CREATE or UPDATE event, the bug is causing problems.
Whenever a remote node replaces an entry in the overlay_peer_table but
the intermediate delete state was not received by the local node, the
new CREATE event would be translated to an UpdateEvent by NetworkDB and
subsequently handled by the overlay driver as if the entry was deleted!

Bubble table UPDATE events up to the network driver as Update events.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

LGTM

@thaJeztah thaJeztah merged commit bb858f3 into moby:master Jun 19, 2025
325 of 372 checks passed
@corhere corhere deleted the libn/fix-networkdb-dns-update-delete branch June 19, 2025 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

28.2.2 Some Swarm services are not discoverable over DNS

5 participants