Skip to content

libnetwork/d/overlay: properly model peer db#50106

Merged
robmry merged 3 commits intomoby:masterfrom
corhere:libn/overlay-peerdb-soundness
Jun 30, 2025
Merged

libnetwork/d/overlay: properly model peer db#50106
robmry merged 3 commits intomoby:masterfrom
corhere:libn/overlay-peerdb-soundness

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented May 29, 2025

- What I did
The overlay network driver's peer database is responsible for configuring the data plane in response to control-plane updates to the NetworkDB overlay_peer_table. It maintains the Forwarding Database (FDB) for the VXLAN interface which maps destination MAC addresses to the VXLAN Tunnel Endpoint (VTEP) IP:port of the remote host which encapsulated packets should be forwarded onto, and the neighbour tables which map IP addresses to MAC addresses. Note that when a userland process sends a packet addressed to the IP address of a remote endpoint, the kernel first resolves the MAC address for the IP when the packet enters the network stack, then later resolves the VTEP from the MAC address when forwarding the packet out the VXLAN interface. The kernel is therefore capable of handling n:1 mappings of IP->MAC and n:1 mappings of MAC->VTEP.

The overlay peer db receives control-plane updates in the form of (Endpoint-ID, IP, MAC, VTEP) tuples. The trouble is that it assumes the overlay_peer_table will eventually converge to a state where each entry has a unique IP address and MAC address. The peer db deletes both the IP->MAC neighbour entry and the MAC->VTEP FDB entry when a peer entry is deleted from overlay_peer_table, without checking whether any other neighbour entries still need the FDB entry. Consequently, the data plane can fall out of sync with the control plane if the assumption is violated.

I modified the overlay driver's peer database to correctly handle cases where multiple endpoint IP addresses map to the same MAC address, or multiple MAC addresses map to the same VTEP. This change lifts one of the major limitations which has been blocking our ability to implement support for IPv6 endpoints on overlay networks.

- How I did it
With reference counting. See the individual commit messages for more detail.

- How to verify it

  1. Create a multi-node Swarm cluster.
  2. Create a user-defined encrypted overlay network.
  3. docker service create --mode global --network my-overlay nginxdemos/hello:plain-text
  4. Exec into one of the service containers and verify that the other containers can be curl'ed by the IP address of its user-defined overlay network endpoint. Repeat with the other replicas.
  5. curl port 8080 on each Swarm node. Verify that the service is reachable. Repeat to verify that the requests are load-balanced among the replicas.

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

@corhere corhere added kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/networking Networking kind/bugfix PR's that fix bugs area/networking/d/overlay Networking labels May 29, 2025
@corhere corhere force-pushed the libn/overlay-peerdb-soundness branch from ec20909 to 4a99260 Compare June 23, 2025 20:06
@corhere corhere marked this pull request as ready for review June 23, 2025 20:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a generic reference‐counting map and refactors the overlay driver to use it for VXLAN FDB entries, while also updating the encryption map to track SPI counts.

  • Add countmap.Map[T] with Add semantics and unit tests
  • Refactor peerdb to key by netip.Prefix and maintain FDB entries via refcounts
  • Change encryption map to encrNode{spi []spi, count} and update related methods

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/countmap/countmap.go Add generic Map[T] type with Add method for reference counting
internal/countmap/countmap_test.go Unit tests for countmap.Map behavior
drivers/overlay/peerdb.go Switch peer keys to netip.Prefix, implement FDB refcount logic
drivers/overlay/ov_network.go Initialize per‐network fdbCnt with countmap.Map[ipmac]
drivers/overlay/overlay.go Change secMap.nodes to map[netip.Addr]encrNode
drivers/overlay/encryption.go Introduce encrNode with count, update setup/remove/walk logic
Comments suppressed due to low confidence (1)

libnetwork/drivers/overlay/encryption.go:169

  • [nitpick] Local variable spi shadows the spi type and can be confusing. Consider renaming the variable to spis or spiList for clarity.
	spi := func() []spi {

return nil
// Delete neighbor entry for the peer IP
if err := sbox.DeleteNeighbor(peerIP.Addr().AsSlice(), peerMac, osl.WithLinkName(s.vxlanName), osl.WithFamily(syscall.AF_BRIDGE)); err != nil {
return fmt.Errorf("could not delete neighbor entry in the sandbox:%v", err)
Copy link

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting is missing a space before %v and doesn’t wrap the error. Use "could not delete neighbor entry in the sandbox: %w" to improve readability and error chaining.

Suggested change
return fmt.Errorf("could not delete neighbor entry in the sandbox:%v", err)
return fmt.Errorf("could not delete neighbor entry in the sandbox: %w", err)

Copilot uses AI. Check for mistakes.
corhere added 3 commits June 24, 2025 13:28
The peer db implementation is more complex than it needs to be.
Notably, the peerCRUD / peerCRUDOp function split is a vestige of its
evolution from a worker goroutine receiving commands over a channel.

Refactor the peer db operations to be easier to read, understand and
modify. Factor the kernel-programming operations out into dedicated
addNeighbor and deleteNeighbor functions. Inline the rest of the
peerCRUDOp functions into their respective peerCRUD wrappers.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver assumes that the peer table in NetworkDB will always
converge to a 1:1:1 mapping from peer endpoint IP address to MAC address
to VTEP. While this currently holds true in practice most of the time,
it is not an invariant and there are ways that users can violate this
assumption.

The driver detects whether peer entries conflict with each other by
matching up (IP, MAC) tuples. In the common case this works out fine as
the MAC address for an endpoint is generally derived from the assigned
IP address. If an IP address gets reassigned to a container on another
node the MAC address will follow, so the driver's conflict resolution
logic will behave as intended. However users may explicitly configure
the MAC address for a container's network endpoints. If an IP address
gets reassigned from a container with an auto-generated MAC address to a
container with a manually-configured MAC, or vice versa, the driver
would not detect the conflict as the (IP, MAC) tuples won't match up. It
would attempt to program the kernel's neighbor table with two
conflicting MAC addresses for one IP, which will fail. And since it
does not realize that there is a conflict, the driver won't reprogram
the kernel from the remaining entry when the other entry is deleted.

The assumption that only one IP address may resolve to a given MAC
address is violated if multiple IP addresses are assigned to an
endpoint. This rarely comes up in practice today as the overlay driver
only supports IPv4 single-stack connectivity for endpoints. If multiple
distinct peer entries exist with the same MAC address, the driver will
delete the MAC->VTEP mapping from the kernel's forwarding database when
any entry is deleted, even if other entries remain active. This
limitation is one of the biggest obstacles in the way of supporting IPv6
and dual-stack connectivity for endpoints attached to overlay networks.

Modify the peer db logic to correctly handle the cases where peer
entries have non-unique MAC or VTEP values. Treat any set of entries
with non-unique IP addresses as a conflict, irrespective of the entries'
MAC addresses. Maintain a reference count of forwarding database entries
and only delete the MAC->VTEP mapping from the kernel when there are no
longer any neighbor entries which resolve to that MAC.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The IPsec encryption parameters (Security Association Database and
Security Policy Database entries) for a particular overlay network peer
(VTEP) are shared global state as they have to be programmed into the
root network namespace. The same parameters are used when encrypting
VXLAN traffic to a particular VTEP for all overlay networks. Deleting
the entries for a VTEP will break encryption to that VTEP across all
encrypted overlay networks, therefore the decision of when to delete the
entries must take the state of all overlay networks into account.
Unfortunately this is not the case.

The overlay driver uses local per-network state to decide when to
program and delete the parameters for a VTEP. In practice, the
parameters for all VTEPs participating in an encrypted overlay network
are deleted when the network is deleted. Encryption to that VTEP over
all other active encrypted overlay networks would be broken until some
other incidental peerDB event triggered a re-programming of the
parameters for that VTEP.

Change the setupEncryption and removeEncryption functions to be
reference-counted. The removeEncryption function needs to be called the
same number of times as addEncryption before the parameters are deleted
from the kernel.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere force-pushed the libn/overlay-peerdb-soundness branch from 3c69539 to 057e35d Compare June 24, 2025 17:31
@corhere corhere added this to the 29.0.0 milestone Jun 25, 2025
@dmcgowan dmcgowan moved this from New to Complete in 🔦 Maintainer spotlight Jun 26, 2025
@robmry robmry merged commit 9cb179d into moby:master Jun 30, 2025
228 of 229 checks passed
@robmry
Copy link
Contributor

robmry commented Jun 30, 2025

I merged this - but wondered if there's an integration test that covers the test steps? Or, if it'd be possible to add one?

@corhere
Copy link
Contributor Author

corhere commented Jul 2, 2025

I merged this - but wondered if there's an integration test that covers the test steps? Or, if it'd be possible to add one?

It's possible to write an integration test in theory, but the daemons under test would need to be running in separate network namespaces to give each instance of the overlay driver its own sandbox to play in. We don't have any test harnesses to start multiple daemons in separate network namespaces, making it non-trivial to write end-to-end tests for the overlay driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking/d/overlay Networking area/networking Networking kind/bugfix PR's that fix bugs kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. process/cherry-pick/25.0

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants