libnetwork/d/overlay: fix logical races & other soundness bugs#50081
libnetwork/d/overlay: fix logical races & other soundness bugs#50081corhere wants to merge 12 commits intomoby:masterfrom
Conversation
c5e4c58 to
d0d4ae8
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the overlay network driver to eliminate logical races and soundness bugs by introducing clearer lock hierarchies, parameterizing internal data structures with generics, and simplifying neighbor entry management.
- Convert
setmatrixand related maps to use Go generics for key/value types. - Rewrite neighbor add/delete to use a common helper (
nlNeigh) and improve error reporting. - Restructure locking in the overlay driver and network to prevent deadlocks and races, and rework the encryption map to track reference counts.
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| service.go | Parameterize ipToEndpoint with generics |
| libnetwork/osl/neigh_linux.go | Simplify neighbor logic and error formatting |
| libnetwork/osl/namespace_linux.go | Remove obsolete neighbors cache |
| libnetwork/osl/interface_linux.go | Use direct nlHandle and streamline lock usage |
| libnetwork/network.go | Update setmatrix usage for service maps |
| libnetwork/internal/setmatrix/*.go | Convert SetMatrix to generic SetMatrix[K,V] |
| libnetwork/internal/countmap/*.go | Add countmap.Map type and tests |
| libnetwork/drivers/overlay/*.go & *.test.go | Major refactor: locking, encryption map, peer operations |
Comments suppressed due to low confidence (1)
libnetwork/drivers/overlay/peerdb_test.go:1
- [nitpick] The
peerdb_test.gofile was removed, potentially leaving peer database code untested. Consider reintroducing equivalent tests to maintain coverage for peer marshalling/unmarshalling behavior.
//go:build linux
| } | ||
|
|
||
| func insertDeleteRotuine(ctx context.Context, endCh chan int, s *SetMatrix[string], key, value string) { | ||
| func insertDeleteRotuine(ctx context.Context, endCh chan int, s *SetMatrix[string, string], key, value string) { |
There was a problem hiding this comment.
[nitpick] The function name insertDeleteRotuine has a typo; it should be insertDeleteRoutine for clarity.
| func insertDeleteRotuine(ctx context.Context, endCh chan int, s *SetMatrix[string, string], key, value string) { | |
| func insertDeleteRoutine(ctx context.Context, endCh chan int, s *SetMatrix[string, string], key, value string) { |
349cd45 to
552d04a
Compare
thaJeztah
left a comment
There was a problem hiding this comment.
made a first pass, but there's a LOT in this PR, so will need to find time to make another pass.
Perhaps splitting off some of the first few commits (which are fairly trivial) to get those merged already?
libnetwork/drivers/overlay/peerdb.go
Outdated
| s := n.getSubnetforIP(IP) | ||
| if s == nil { | ||
| return fmt.Errorf("couldn't find the subnet %q in network %q", IP.String(), n.id) |
There was a problem hiding this comment.
Nits / thoughts; looks like we use could not in most places - worth considering for consistency.
Also contemplating if (not for this PR, which is large enough already) should have getSubnetforIP (there's multiple implementations though) return an error. It looks like there's multiple places where nil is converted to an error;
return fmt.Errorf("could not find subnet for endpoint %s", eid)
fmt.Errorf("no matching subnet for IP %q in network %q", ep.addr, nid)
return fmt.Errorf("couldn't find the subnet %q in network %q", IP.String(), n.id)And indirectly;
fmt.Errorf("could not find a valid ipv4 subnet for endpoint %s", eid)
fmt.Errorf("could not find a valid ipv6 subnet for endpoint %s", eid)| ) | ||
|
|
||
| // NeighborSearchError indicates that the neighbor is already present | ||
| type NeighborSearchError struct { |
There was a problem hiding this comment.
For those wondering if this must be an exported error; it looks like this is indeed used as sentinel error currently;
moby/libnetwork/drivers/overlay/peerdb.go
Lines 325 to 330 in 0342985
moby/libnetwork/drivers/overlay/peerdb.go
Lines 380 to 384 in 0342985
| func (n NeighborSearchError) Error() string { | ||
| var b strings.Builder | ||
| b.WriteString("neighbor entry ") | ||
| if n.present { |
There was a problem hiding this comment.
Not a blocker, but was mostly looking here for trade-offs between performance-optimisation (strings.Builder) and discoverability (being able to grep for the (partial) error message); wasn't sure how much this error is in a "hot path".
libnetwork/osl/neigh_linux.go
Outdated
| log.G(context.TODO()).WithFields(log.Fields{ | ||
| "ip": dstIP, | ||
| "mac": dstMac, | ||
| "ifc": linkName, | ||
| }).WithError(err).Warn("error deleting neighbor entry") |
There was a problem hiding this comment.
Can probably roll up the WithError in the fields;
| log.G(context.TODO()).WithFields(log.Fields{ | |
| "ip": dstIP, | |
| "mac": dstMac, | |
| "ifc": linkName, | |
| }).WithError(err).Warn("error deleting neighbor entry") | |
| log.G(context.TODO()).WithFields(log.Fields{ | |
| "error": err, | |
| "ip": dstIP, | |
| "mac": dstMac, | |
| "ifc": linkName, | |
| }).Warn("error deleting neighbor entry") |
libnetwork/osl/neigh_linux.go
Outdated
| log.G(context.TODO()).WithFields(log.Fields{ | ||
| "ip": dstIP, | ||
| "mac": dstMac, | ||
| "ifc": linkName, | ||
| }).WithError(err).Warn("error deleting dynamic neighbor entry") |
There was a problem hiding this comment.
Same here;
| log.G(context.TODO()).WithFields(log.Fields{ | |
| "ip": dstIP, | |
| "mac": dstMac, | |
| "ifc": linkName, | |
| }).WithError(err).Warn("error deleting dynamic neighbor entry") | |
| log.G(context.TODO()).WithFields(log.Fields{ | |
| "error": err, | |
| "ip": dstIP, | |
| "mac": dstMac, | |
| "ifc": linkName, | |
| }).Warn("error deleting dynamic neighbor entry") |
libnetwork/osl/neigh_linux.go
Outdated
| "ip": dstIP, | ||
| "mac": dstMac, | ||
| "ifc": linkName, | ||
| }).Debugf("Neighbor entry deleted") |
There was a problem hiding this comment.
| }).Debugf("Neighbor entry deleted") | |
| }).Debug("Neighbor entry deleted") |
libnetwork/drivers/overlay/peerdb.go
Outdated
| log.G(context.TODO()).Warnf("Entry already present in db: nid:%s eid:%s peerIP:%v peerMac:%v vtep:%v", | ||
| nid, eid, peerIP, peerMac, vtep) |
There was a problem hiding this comment.
As we're updating; perhaps worth using WithFields for this one as well?
libnetwork/drivers/overlay/peerdb.go
Outdated
| // Note also that this method will atomically loop on the whole table of peers | ||
| // and will program their state in one single atomic operation. | ||
| // This is fundamental to guarantee consistency, and avoid that |
There was a problem hiding this comment.
Nit (but existing); avoid future tense;
| // Note also that this method will atomically loop on the whole table of peers | |
| // and will program their state in one single atomic operation. | |
| // This is fundamental to guarantee consistency, and avoid that | |
| // Note also that this method atomically loops the whole table of peers | |
| // and programs their state in one single atomic operation. | |
| // This is fundamental to guarantee consistency, and avoid that |
|
I'll slice and dice this down to more manageable chunks for review, and maybe rearrange some of the commits |
The overlay driver's checkEncryption function configures the IPSec parameters for the VXLAN tunnels to peer nodes. When called with isLocal=true, it configures encryption for all peer nodes with at least one peerDB entry. Since the local peers are also included in the peerDB, it needs to filter those entries out. It does so by filtering out any peer entries whose VTEP address is equal to the current local advertise address. Trouble is, the local advertise address is not necessarily constant. The driver tries to handle this case by calling peerDBUpdateSelf() when the advertise address changes. This function iterates through the peerDB and tries to update the VTEP address for all local peer entries, but it does not actually do anything: it mutates a temporary copy of the entry which is not persisted back into the peerDB. (It used to be functional, but was broken when the peerDB was extended to use SetMatrix.) So there may be cases where local peer entries are not filtered out properly, resulting in spurious encryption parameters being programmed into the kernel. Filter out local peers when walking the peerDB by filtering on whether the entry has the isLocal flag set. Remove the no-op code which attempts to update local entries in the peerDB. No other code takes any interest in the VTEP value for isLocal peer entries. Signed-off-by: Cory Snider <csnider@mirantis.com>
The VTEP value for a peer in peerDB is only accurate for a remote peer. The VTEP for a local peer would be the driver's advertise address, which is not necessarily constant for the lifetime of the driver instance. The VTEP values persisted in the peerDB entries for local peers could be stale or missing if not kept in sync with the advertise address. And the peerDB could get polluted with duplicate entries for local peers if the advertise address was to change, as entries which differ only by VTEP are considered distinct by SetMatrix. Persisting the advertise address as the VTEP for local peers creates lots of problems that are not easy to solve. Stop persisting the VTEP for local peers in peerDB. Any code that needs to know the VTEP for local peers can look that up from the source of truth: the driver's advertise address. Use the lack of a VTEP in peerDB entries to signify local peers, making the isLocal flag redundant. Signed-off-by: Cory Snider <csnider@mirantis.com>
Drop the isLocal boolean parameters from the peerDB functions. Local
peers have vtep == netip.Addr{}.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Since it is not meaningful to add or remove encryption between the local
node and itself, the isLocal parameter is redundant. Setting up
encryption for all network peers is now invoked by calling
checkEncryption(nid, netip.Addr{}, true)
Calling checkEncryption with isLocal=true, add=false is now more
explicitly a no-op. It always was effectively a no-op, but that was not
easy to spot by inspection. In the world with the isLocal flag,
calls to checkEncryption where isLocal=true and add=false would have rIP
set to d.advertiseAddr. In other words, it was a request to remove
encryption parameters between the local peer and itself if peerDB had no
remote-peer entries for the network. So either the call would do
nothing, or it would remove encryption parameters that aren't used for
anything. Now the equivalent call always does nothing.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The setupEncryption and removeEncryption functions take several parameters, but all call sites pass the same values for all the parameters aside from remoteIP: values taken from fields of the driver struct. Refactor these functions to be methods of the driver struct and drop the redundant parameters. Signed-off-by: Cory Snider <csnider@mirantis.com>
In addition to being three functions in a trenchcoat, the checkEncryption function has a very subtle implementation which is difficult to reason about. That is not a good property for security relevant code to have. Replace two of the three calls to checkEncryption with conditional calls to setupEncryption and removeEncryption, lifting the conditional logic which was hidden away in checkEncryption into the call sites to make it easier to reason about the code. Replace the third call with a call to a new initEncryption function. Signed-off-by: Cory Snider <csnider@mirantis.com>
The (*driver).Join function does many things to set up overlay networking. One of the first things it does is call (*network).joinSandbox, which in turn calls (*driver).initSandboxPeerDB. The initSandboxPeerDB function iterates through the peer db to add entries to the VXLAN FDB, neighbor table and IPsec security association database in the kernel for all known peers on the overlay network. One of the last things the (*driver).Join function does is call (*driver).initEncryption. The initEncryption function iterates through the peer db to add entries to the IPsec security association database in the kernel for all known peers on the overlay network. But the preceding initSandboxPeerDB call already did that! The initEncryption function is redundant and can safely be removed. Signed-off-by: Cory Snider <csnider@mirantis.com>
The peer db implementation is more complex than it needs to be. Notably, the peerCRUD / peerCRUDOp function split is a vestige of its evolution from a worker goroutine receiving commands over a channel. Refactor the peer db operations to be easier to read, understand and modify. Factor the kernel-programming operations out into dedicated addNeighbor and deleteNeighbor functions. Inline the rest of the peerCRUDOp functions into their respective peerCRUD wrappers. Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver assumes that the peer table in NetworkDB will always converge to a 1:1:1 mapping from peer endpoint IP address to MAC address to VTEP. While this currently holds true in practice most of the time, it is not an invariant and there are ways that users can violate this assumption. The driver detects whether peer entries conflict with each other by matching up (IP, MAC) tuples. In the common case this works out fine as the MAC address for an endpoint is generally derived from the assigned IP address. If an IP address gets reassigned to a container on another node the MAC address will follow, so the driver's conflict resolution logic will behave as intended. However users may explicitly configure the MAC address for a container's network endpoints. If an IP address gets reassigned from a container with an auto-generated MAC address to a container with a manually-configured MAC, or vice versa, the driver would not detect the conflict as the (IP, MAC) tuples won't match up. It would attempt to program the kernel's neighbor table with two conflicting MAC addresses for one IP, which will fail. And since it does not realize that there is a conflict, the driver won't reprogram the kernel from the remaining entry when the other entry is deleted. The assumption that only one IP address may resolve to a given MAC address is violated if multiple IP addresses are assigned to an endpoint. This rarely comes up in practice today as the overlay driver only supports IPv4 single-stack connectivity for endpoints. If multiple distinct peer entries exist with the same MAC address, the driver will delete the MAC->VTEP mapping from the kernel's forwarding database when any entry is deleted, even if other entries remain active. This limitation is one of the biggest obstacles in the way of supporting IPv6 and dual-stack connectivity for endpoints attached to overlay networks. Modify the peer db logic to correctly handle the cases where peer entries have non-unique MAC or VTEP values. Treat any set of entries with non-unique IP addresses as a conflict, irrespective of the entries' MAC addresses. Maintain a reference count of forwarding database entries and only delete the MAC->VTEP mapping from the kernel when there are no longer any neighbor entries which resolve to that MAC. Signed-off-by: Cory Snider <csnider@mirantis.com>
The IPsec encryption parameters (Security Association Database and Security Policy Database entries) for a particular overlay network peer (VTEP) are shared global state as they have to be programmed into the root network namespace. The same parameters are used when encrypting VXLAN traffic to a particular VTEP for all overlay networks. Deleting the entries for a VTEP will break encryption to that VTEP across all encrypted overlay networks, therefore the decision of when to delete the entries must take the state of all overlay networks into account. Unfortunately this is not the case. The overlay driver uses local per-network state to decide when to program and delete the parameters for a VTEP. In practice, the parameters for all VTEPs participating in an encrypted overlay network are deleted when the network is deleted. Encryption to that VTEP over all other active encrypted overlay networks would be broken until some other incidental peerDB event triggered a re-programming of the parameters for that VTEP. Change the setupEncryption and removeEncryption functions to be reference-counted. The removeEncryption function needs to be called the same number of times as addEncryption before the parameters are deleted from the kernel. Signed-off-by: Cory Snider <csnider@mirantis.com>
It is easier to find all references when they are struct fields rather than embedded structs. Signed-off-by: Cory Snider <csnider@mirantis.com>
The concurrency control in the overlay driver is logically unsound. While the use of mutexes is sufficient to prevent data races -- violations of the Go memory model -- many operations which need to be atomic are performed with unbounded concurrency. For instance, programming the encryption parameters for a peer can race with encryption keys being updated, which could lead to inconsistencies between the parameters programmed into the kernel and the desired state. Overhaul the use of locks in the overlay network driver. Implement sound locking at the network granularity: operations may proceed concurrently iff they are being applied to distinct networks. Push the responsibility of locking up to the code which calls methods or accesses struct fields to avoid deadlock situations like we had previously with d.initSandboxPeerDB() and to make the code easier to reason about. Each overlay network has a distinct peer db. The NetworkDB watch for the overlay peer table for the network will only start after (*driver).CreateNetwork returns and will be stopped before libnetwork calls (*driver).DeleteNetwork, therefore the lifetime of the peer db for a network is constrained to the lifetime of the network itself. Yet the peer db for a network is tracked in a dedicated map, separately from the network objects themselves. This has resulted in a parallel set of mutexes to manage concurrency of the peer db distinct from the mutexes for the driver and networks. Move the peer db for a network into a field of the network struct and guard it from concurrent access using the per-network lock. Move the methods for manipulating the peer db into the network struct so that the methods can only be called if the caller has a reference to the network object. Network creation and deletion is synchronized using the driver-scope mutex, but some of the kernel programming is performed outside of the critical section. It is possible for network deletion to race with recreating the network, interleaving the kernel programming for the network creation and deletion, resulting in inconsistent kernel state. Parallelize network creation and deletion soundly. Use a double-checked locking scheme to soundly handle the case of concurrent CreateNetwork and DeleteNetwork for the same network id without blocking operations on other networks. Synchronize operations on a network so that operations on the network such as adding a neighbor to the peer db are performed atomically, not interleaved with deleting the network. Handle encryption-key updates in a critical section so they can no longer be interleaved with kernel programming of encryption parameters. Signed-off-by: Cory Snider <csnider@mirantis.com>
552d04a to
a4a2a4b
Compare
|
Closing; superseded by the broken-out PRs |
- What I did
I fixed a pile of subtle issues in the overlay network driver while also refactoring it to get rid of unneeded complexity. Some of the issues fixed were also blockers for supporting IPv6 connectivity for containers! (There is still more to be done to enable IPv6, but this gets us quite a ways!)
- How I did it
Lock hierarchies, reference counts and refactoring. See the individual commit messages for more detail.
- How to verify it
docker service create --mode global --network my-overlay -p 80:8080 nginxdemos/hello:plain-text- Human readable description for the release notes
- Improve the reliability of the overlay network driver- A picture of a cute animal (not mandatory but encouraged)