libnetwork/d/overlay: properly model peer db#50106
Conversation
ec20909 to
4a99260
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a generic reference‐counting map and refactors the overlay driver to use it for VXLAN FDB entries, while also updating the encryption map to track SPI counts.
- Add
countmap.Map[T]withAddsemantics and unit tests - Refactor
peerdbto key bynetip.Prefixand maintain FDB entries via refcounts - Change encryption map to
encrNode{spi []spi, count}and update related methods
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/countmap/countmap.go | Add generic Map[T] type with Add method for reference counting |
| internal/countmap/countmap_test.go | Unit tests for countmap.Map behavior |
| drivers/overlay/peerdb.go | Switch peer keys to netip.Prefix, implement FDB refcount logic |
| drivers/overlay/ov_network.go | Initialize per‐network fdbCnt with countmap.Map[ipmac] |
| drivers/overlay/overlay.go | Change secMap.nodes to map[netip.Addr]encrNode |
| drivers/overlay/encryption.go | Introduce encrNode with count, update setup/remove/walk logic |
Comments suppressed due to low confidence (1)
libnetwork/drivers/overlay/encryption.go:169
- [nitpick] Local variable
spishadows thespitype and can be confusing. Consider renaming the variable tospisorspiListfor clarity.
spi := func() []spi {
| return nil | ||
| // Delete neighbor entry for the peer IP | ||
| if err := sbox.DeleteNeighbor(peerIP.Addr().AsSlice(), peerMac, osl.WithLinkName(s.vxlanName), osl.WithFamily(syscall.AF_BRIDGE)); err != nil { | ||
| return fmt.Errorf("could not delete neighbor entry in the sandbox:%v", err) |
There was a problem hiding this comment.
The formatting is missing a space before %v and doesn’t wrap the error. Use "could not delete neighbor entry in the sandbox: %w" to improve readability and error chaining.
| return fmt.Errorf("could not delete neighbor entry in the sandbox:%v", err) | |
| return fmt.Errorf("could not delete neighbor entry in the sandbox: %w", err) |
4a99260 to
3c69539
Compare
The peer db implementation is more complex than it needs to be. Notably, the peerCRUD / peerCRUDOp function split is a vestige of its evolution from a worker goroutine receiving commands over a channel. Refactor the peer db operations to be easier to read, understand and modify. Factor the kernel-programming operations out into dedicated addNeighbor and deleteNeighbor functions. Inline the rest of the peerCRUDOp functions into their respective peerCRUD wrappers. Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver assumes that the peer table in NetworkDB will always converge to a 1:1:1 mapping from peer endpoint IP address to MAC address to VTEP. While this currently holds true in practice most of the time, it is not an invariant and there are ways that users can violate this assumption. The driver detects whether peer entries conflict with each other by matching up (IP, MAC) tuples. In the common case this works out fine as the MAC address for an endpoint is generally derived from the assigned IP address. If an IP address gets reassigned to a container on another node the MAC address will follow, so the driver's conflict resolution logic will behave as intended. However users may explicitly configure the MAC address for a container's network endpoints. If an IP address gets reassigned from a container with an auto-generated MAC address to a container with a manually-configured MAC, or vice versa, the driver would not detect the conflict as the (IP, MAC) tuples won't match up. It would attempt to program the kernel's neighbor table with two conflicting MAC addresses for one IP, which will fail. And since it does not realize that there is a conflict, the driver won't reprogram the kernel from the remaining entry when the other entry is deleted. The assumption that only one IP address may resolve to a given MAC address is violated if multiple IP addresses are assigned to an endpoint. This rarely comes up in practice today as the overlay driver only supports IPv4 single-stack connectivity for endpoints. If multiple distinct peer entries exist with the same MAC address, the driver will delete the MAC->VTEP mapping from the kernel's forwarding database when any entry is deleted, even if other entries remain active. This limitation is one of the biggest obstacles in the way of supporting IPv6 and dual-stack connectivity for endpoints attached to overlay networks. Modify the peer db logic to correctly handle the cases where peer entries have non-unique MAC or VTEP values. Treat any set of entries with non-unique IP addresses as a conflict, irrespective of the entries' MAC addresses. Maintain a reference count of forwarding database entries and only delete the MAC->VTEP mapping from the kernel when there are no longer any neighbor entries which resolve to that MAC. Signed-off-by: Cory Snider <csnider@mirantis.com>
The IPsec encryption parameters (Security Association Database and Security Policy Database entries) for a particular overlay network peer (VTEP) are shared global state as they have to be programmed into the root network namespace. The same parameters are used when encrypting VXLAN traffic to a particular VTEP for all overlay networks. Deleting the entries for a VTEP will break encryption to that VTEP across all encrypted overlay networks, therefore the decision of when to delete the entries must take the state of all overlay networks into account. Unfortunately this is not the case. The overlay driver uses local per-network state to decide when to program and delete the parameters for a VTEP. In practice, the parameters for all VTEPs participating in an encrypted overlay network are deleted when the network is deleted. Encryption to that VTEP over all other active encrypted overlay networks would be broken until some other incidental peerDB event triggered a re-programming of the parameters for that VTEP. Change the setupEncryption and removeEncryption functions to be reference-counted. The removeEncryption function needs to be called the same number of times as addEncryption before the parameters are deleted from the kernel. Signed-off-by: Cory Snider <csnider@mirantis.com>
3c69539 to
057e35d
Compare
|
I merged this - but wondered if there's an integration test that covers the test steps? Or, if it'd be possible to add one? |
It's possible to write an integration test in theory, but the daemons under test would need to be running in separate network namespaces to give each instance of the overlay driver its own sandbox to play in. We don't have any test harnesses to start multiple daemons in separate network namespaces, making it non-trivial to write end-to-end tests for the overlay driver. |
- What I did
The overlay network driver's peer database is responsible for configuring the data plane in response to control-plane updates to the NetworkDB
overlay_peer_table. It maintains the Forwarding Database (FDB) for the VXLAN interface which maps destination MAC addresses to the VXLAN Tunnel Endpoint (VTEP) IP:port of the remote host which encapsulated packets should be forwarded onto, and the neighbour tables which map IP addresses to MAC addresses. Note that when a userland process sends a packet addressed to the IP address of a remote endpoint, the kernel first resolves the MAC address for the IP when the packet enters the network stack, then later resolves the VTEP from the MAC address when forwarding the packet out the VXLAN interface. The kernel is therefore capable of handling n:1 mappings of IP->MAC and n:1 mappings of MAC->VTEP.The overlay peer db receives control-plane updates in the form of (Endpoint-ID, IP, MAC, VTEP) tuples. The trouble is that it assumes the
overlay_peer_tablewill eventually converge to a state where each entry has a unique IP address and MAC address. The peer db deletes both the IP->MAC neighbour entry and the MAC->VTEP FDB entry when a peer entry is deleted fromoverlay_peer_table, without checking whether any other neighbour entries still need the FDB entry. Consequently, the data plane can fall out of sync with the control plane if the assumption is violated.I modified the overlay driver's peer database to correctly handle cases where multiple endpoint IP addresses map to the same MAC address, or multiple MAC addresses map to the same VTEP. This change lifts one of the major limitations which has been blocking our ability to implement support for IPv6 endpoints on overlay networks.
- How I did it
With reference counting. See the individual commit messages for more detail.
- How to verify it
docker service create --mode global --network my-overlay nginxdemos/hello:plain-text- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)