Skip to content

libnetwork/d/overlay: fix logical races & other soundness bugs#50081

Closed
corhere wants to merge 12 commits intomoby:masterfrom
corhere:libn/overlay-soundness
Closed

libnetwork/d/overlay: fix logical races & other soundness bugs#50081
corhere wants to merge 12 commits intomoby:masterfrom
corhere:libn/overlay-soundness

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented May 26, 2025

- What I did
I fixed a pile of subtle issues in the overlay network driver while also refactoring it to get rid of unneeded complexity. Some of the issues fixed were also blockers for supporting IPv6 connectivity for containers! (There is still more to be done to enable IPv6, but this gets us quite a ways!)

- How I did it
Lock hierarchies, reference counts and refactoring. See the individual commit messages for more detail.

- How to verify it

  1. Create a multi-node Swarm cluster.
  2. Create a user-defined overlay network.
  3. docker service create --mode global --network my-overlay -p 80:8080 nginxdemos/hello:plain-text
  4. Exec into one of the service containers and verify that the other containers can be curl'ed by the IP address of its user-defined overlay network endpoint. Repeat with the other replicas.
  5. curl port 8080 on each Swarm node. Verify that the service is reachable. Repeat to verify that the requests are load-balanced among the replicas.

- Human readable description for the release notes

- Improve the reliability of the overlay network driver

- A picture of a cute animal (not mandatory but encouraged)

@corhere corhere added kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/networking Networking impact/changelog kind/bugfix PR's that fix bugs kind/refactor PR's that refactor, or clean-up code area/networking/d/overlay Networking labels May 26, 2025
@corhere corhere force-pushed the libn/overlay-soundness branch from c5e4c58 to d0d4ae8 Compare May 26, 2025 18:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the overlay network driver to eliminate logical races and soundness bugs by introducing clearer lock hierarchies, parameterizing internal data structures with generics, and simplifying neighbor entry management.

  • Convert setmatrix and related maps to use Go generics for key/value types.
  • Rewrite neighbor add/delete to use a common helper (nlNeigh) and improve error reporting.
  • Restructure locking in the overlay driver and network to prevent deadlocks and races, and rework the encryption map to track reference counts.

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
service.go Parameterize ipToEndpoint with generics
libnetwork/osl/neigh_linux.go Simplify neighbor logic and error formatting
libnetwork/osl/namespace_linux.go Remove obsolete neighbors cache
libnetwork/osl/interface_linux.go Use direct nlHandle and streamline lock usage
libnetwork/network.go Update setmatrix usage for service maps
libnetwork/internal/setmatrix/*.go Convert SetMatrix to generic SetMatrix[K,V]
libnetwork/internal/countmap/*.go Add countmap.Map type and tests
libnetwork/drivers/overlay/*.go & *.test.go Major refactor: locking, encryption map, peer operations
Comments suppressed due to low confidence (1)

libnetwork/drivers/overlay/peerdb_test.go:1

  • [nitpick] The peerdb_test.go file was removed, potentially leaving peer database code untested. Consider reintroducing equivalent tests to maintain coverage for peer marshalling/unmarshalling behavior.
//go:build linux

}

func insertDeleteRotuine(ctx context.Context, endCh chan int, s *SetMatrix[string], key, value string) {
func insertDeleteRotuine(ctx context.Context, endCh chan int, s *SetMatrix[string, string], key, value string) {
Copy link

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The function name insertDeleteRotuine has a typo; it should be insertDeleteRoutine for clarity.

Suggested change
func insertDeleteRotuine(ctx context.Context, endCh chan int, s *SetMatrix[string, string], key, value string) {
func insertDeleteRoutine(ctx context.Context, endCh chan int, s *SetMatrix[string, string], key, value string) {

Copilot uses AI. Check for mistakes.
@corhere corhere force-pushed the libn/overlay-soundness branch 3 times, most recently from 349cd45 to 552d04a Compare May 26, 2025 19:35
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a first pass, but there's a LOT in this PR, so will need to find time to make another pass.

Perhaps splitting off some of the first few commits (which are fairly trivial) to get those merged already?

Comment on lines +382 to +384
s := n.getSubnetforIP(IP)
if s == nil {
return fmt.Errorf("couldn't find the subnet %q in network %q", IP.String(), n.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits / thoughts; looks like we use could not in most places - worth considering for consistency.

Also contemplating if (not for this PR, which is large enough already) should have getSubnetforIP (there's multiple implementations though) return an error. It looks like there's multiple places where nil is converted to an error;

return fmt.Errorf("could not find subnet for endpoint %s", eid)

fmt.Errorf("no matching subnet for IP %q in network %q", ep.addr, nid)

return fmt.Errorf("couldn't find the subnet %q in network %q", IP.String(), n.id)

And indirectly;

fmt.Errorf("could not find a valid ipv4 subnet for endpoint %s", eid)

fmt.Errorf("could not find a valid ipv6 subnet for endpoint %s", eid)

)

// NeighborSearchError indicates that the neighbor is already present
type NeighborSearchError struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those wondering if this must be an exported error; it looks like this is indeed used as sentinel error currently;

if _, ok := err.(osl.NeighborSearchError); ok && dbEntries > 1 {
// We are in the transient case so only the first configuration is programmed into the kernel
// Upon deletion if the active configuration is deleted the next one from the database will be restored
// Note we are skipping also the next configuration
return nil
}

if _, ok := err.(osl.NeighborSearchError); ok && dbEntries > 0 {
// We fall in here if there is a transient state and if the neighbor that is being deleted
// was never been configured into the kernel (we allow only 1 configuration at the time per <ip,mac> mapping)
return nil
}

Comment on lines +23 to +26
func (n NeighborSearchError) Error() string {
var b strings.Builder
b.WriteString("neighbor entry ")
if n.present {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but was mostly looking here for trade-offs between performance-optimisation (strings.Builder) and discoverability (being able to grep for the (partial) error message); wasn't sure how much this error is in a "hot path".

Comment on lines +53 to +57
log.G(context.TODO()).WithFields(log.Fields{
"ip": dstIP,
"mac": dstMac,
"ifc": linkName,
}).WithError(err).Warn("error deleting neighbor entry")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably roll up the WithError in the fields;

Suggested change
log.G(context.TODO()).WithFields(log.Fields{
"ip": dstIP,
"mac": dstMac,
"ifc": linkName,
}).WithError(err).Warn("error deleting neighbor entry")
log.G(context.TODO()).WithFields(log.Fields{
"error": err,
"ip": dstIP,
"mac": dstMac,
"ifc": linkName,
}).Warn("error deleting neighbor entry")

Comment on lines +68 to +72
log.G(context.TODO()).WithFields(log.Fields{
"ip": dstIP,
"mac": dstMac,
"ifc": linkName,
}).WithError(err).Warn("error deleting dynamic neighbor entry")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here;

Suggested change
log.G(context.TODO()).WithFields(log.Fields{
"ip": dstIP,
"mac": dstMac,
"ifc": linkName,
}).WithError(err).Warn("error deleting dynamic neighbor entry")
log.G(context.TODO()).WithFields(log.Fields{
"error": err,
"ip": dstIP,
"mac": dstMac,
"ifc": linkName,
}).Warn("error deleting dynamic neighbor entry")

"ip": dstIP,
"mac": dstMac,
"ifc": linkName,
}).Debugf("Neighbor entry deleted")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}).Debugf("Neighbor entry deleted")
}).Debug("Neighbor entry deleted")

Comment on lines +271 to +272
log.G(context.TODO()).Warnf("Entry already present in db: nid:%s eid:%s peerIP:%v peerMac:%v vtep:%v",
nid, eid, peerIP, peerMac, vtep)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we're updating; perhaps worth using WithFields for this one as well?

Comment on lines +225 to +227
// Note also that this method will atomically loop on the whole table of peers
// and will program their state in one single atomic operation.
// This is fundamental to guarantee consistency, and avoid that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (but existing); avoid future tense;

Suggested change
// Note also that this method will atomically loop on the whole table of peers
// and will program their state in one single atomic operation.
// This is fundamental to guarantee consistency, and avoid that
// Note also that this method atomically loops the whole table of peers
// and programs their state in one single atomic operation.
// This is fundamental to guarantee consistency, and avoid that

@corhere
Copy link
Contributor Author

corhere commented May 27, 2025

I'll slice and dice this down to more manageable chunks for review, and maybe rearrange some of the commits

@corhere corhere marked this pull request as draft May 27, 2025 19:26
corhere added 7 commits May 29, 2025 14:13
The overlay driver's checkEncryption function configures the IPSec
parameters for the VXLAN tunnels to peer nodes. When called with
isLocal=true, it configures encryption for all peer nodes with at least
one peerDB entry. Since the local peers are also included in the peerDB,
it needs to filter those entries out. It does so by filtering out any
peer entries whose VTEP address is equal to the current local advertise
address. Trouble is, the local advertise address is not necessarily
constant. The driver tries to handle this case by calling
peerDBUpdateSelf() when the advertise address changes. This function
iterates through the peerDB and tries to update the VTEP address for all
local peer entries, but it does not actually do anything: it mutates a
temporary copy of the entry which is not persisted back into the peerDB.
(It used to be functional, but was broken when the peerDB was extended
to use SetMatrix.) So there may be cases where local peer entries are
not filtered out properly, resulting in spurious encryption parameters
being programmed into the kernel.

Filter out local peers when walking the peerDB by filtering on whether
the entry has the isLocal flag set. Remove the no-op code which attempts
to update local entries in the peerDB. No other code takes any interest
in the VTEP value for isLocal peer entries.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The VTEP value for a peer in peerDB is only accurate for a remote peer.
The VTEP for a local peer would be the driver's advertise address, which
is not necessarily constant for the lifetime of the driver instance.
The VTEP values persisted in the peerDB entries for local peers could be
stale or missing if not kept in sync with the advertise address. And the
peerDB could get polluted with duplicate entries for local peers if the
advertise address was to change, as entries which differ only by VTEP
are considered distinct by SetMatrix. Persisting the advertise address
as the VTEP for local peers creates lots of problems that are not easy
to solve.

Stop persisting the VTEP for local peers in peerDB. Any code that needs
to know the VTEP for local peers can look that up from the source of
truth: the driver's advertise address. Use the lack of a VTEP in peerDB
entries to signify local peers, making the isLocal flag redundant.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Drop the isLocal boolean parameters from the peerDB functions. Local
peers have vtep == netip.Addr{}.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Since it is not meaningful to add or remove encryption between the local
node and itself, the isLocal parameter is redundant. Setting up
encryption for all network peers is now invoked by calling

    checkEncryption(nid, netip.Addr{}, true)

Calling checkEncryption with isLocal=true, add=false is now more
explicitly a no-op. It always was effectively a no-op, but that was not
easy to spot by inspection. In the world with the isLocal flag,
calls to checkEncryption where isLocal=true and add=false would have rIP
set to d.advertiseAddr. In other words, it was a request to remove
encryption parameters between the local peer and itself if peerDB had no
remote-peer entries for the network. So either the call would do
nothing, or it would remove encryption parameters that aren't used for
anything. Now the equivalent call always does nothing.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The setupEncryption and removeEncryption functions take several
parameters, but all call sites pass the same values for all the
parameters aside from remoteIP: values taken from fields of the driver
struct. Refactor these functions to be methods of the driver struct and
drop the redundant parameters.

Signed-off-by: Cory Snider <csnider@mirantis.com>
In addition to being three functions in a trenchcoat, the
checkEncryption function has a very subtle implementation which is
difficult to reason about. That is not a good property for security
relevant code to have.

Replace two of the three calls to checkEncryption with conditional calls
to setupEncryption and removeEncryption, lifting the conditional logic
which was hidden away in checkEncryption into the call sites to make it
easier to reason about the code. Replace the third call with a call to a
new initEncryption function.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The (*driver).Join function does many things to set up overlay
networking. One of the first things it does is call
(*network).joinSandbox, which in turn calls (*driver).initSandboxPeerDB.
The initSandboxPeerDB function iterates through the peer db to add
entries to the VXLAN FDB, neighbor table and IPsec security association
database in the kernel for all known peers on the overlay network.

One of the last things the (*driver).Join function does is call
(*driver).initEncryption. The initEncryption function iterates through
the peer db to add entries to the IPsec security association database in
the kernel for all known peers on the overlay network. But the preceding
initSandboxPeerDB call already did that! The initEncryption function is
redundant and can safely be removed.

Signed-off-by: Cory Snider <csnider@mirantis.com>
corhere added 2 commits May 29, 2025 14:17
The peer db implementation is more complex than it needs to be.
Notably, the peerCRUD / peerCRUDOp function split is a vestige of its
evolution from a worker goroutine receiving commands over a channel.

Refactor the peer db operations to be easier to read, understand and
modify. Factor the kernel-programming operations out into dedicated
addNeighbor and deleteNeighbor functions. Inline the rest of the
peerCRUDOp functions into their respective peerCRUD wrappers.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay driver assumes that the peer table in NetworkDB will always
converge to a 1:1:1 mapping from peer endpoint IP address to MAC address
to VTEP. While this currently holds true in practice most of the time,
it is not an invariant and there are ways that users can violate this
assumption.

The driver detects whether peer entries conflict with each other by
matching up (IP, MAC) tuples. In the common case this works out fine as
the MAC address for an endpoint is generally derived from the assigned
IP address. If an IP address gets reassigned to a container on another
node the MAC address will follow, so the driver's conflict resolution
logic will behave as intended. However users may explicitly configure
the MAC address for a container's network endpoints. If an IP address
gets reassigned from a container with an auto-generated MAC address to a
container with a manually-configured MAC, or vice versa, the driver
would not detect the conflict as the (IP, MAC) tuples won't match up. It
would attempt to program the kernel's neighbor table with two
conflicting MAC addresses for one IP, which will fail. And since it
does not realize that there is a conflict, the driver won't reprogram
the kernel from the remaining entry when the other entry is deleted.

The assumption that only one IP address may resolve to a given MAC
address is violated if multiple IP addresses are assigned to an
endpoint. This rarely comes up in practice today as the overlay driver
only supports IPv4 single-stack connectivity for endpoints. If multiple
distinct peer entries exist with the same MAC address, the driver will
delete the MAC->VTEP mapping from the kernel's forwarding database when
any entry is deleted, even if other entries remain active. This
limitation is one of the biggest obstacles in the way of supporting IPv6
and dual-stack connectivity for endpoints attached to overlay networks.

Modify the peer db logic to correctly handle the cases where peer
entries have non-unique MAC or VTEP values. Treat any set of entries
with non-unique IP addresses as a conflict, irrespective of the entries'
MAC addresses. Maintain a reference count of forwarding database entries
and only delete the MAC->VTEP mapping from the kernel when there are no
longer any neighbor entries which resolve to that MAC.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The IPsec encryption parameters (Security Association Database and
Security Policy Database entries) for a particular overlay network peer
(VTEP) are shared global state as they have to be programmed into the
root network namespace. The same parameters are used when encrypting
VXLAN traffic to a particular VTEP for all overlay networks. Deleting
the entries for a VTEP will break encryption to that VTEP across all
encrypted overlay networks, therefore the decision of when to delete the
entries must take the state of all overlay networks into account.
Unfortunately this is not the case.

The overlay driver uses local per-network state to decide when to
program and delete the parameters for a VTEP. In practice, the
parameters for all VTEPs participating in an encrypted overlay network
are deleted when the network is deleted. Encryption to that VTEP over
all other active encrypted overlay networks would be broken until some
other incidental peerDB event triggered a re-programming of the
parameters for that VTEP.

Change the setupEncryption and removeEncryption functions to be
reference-counted. The removeEncryption function needs to be called the
same number of times as addEncryption before the parameters are deleted
from the kernel.

Signed-off-by: Cory Snider <csnider@mirantis.com>
corhere added 2 commits May 29, 2025 15:35
It is easier to find all references when they are struct fields rather
than embedded structs.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The concurrency control in the overlay driver is logically unsound.
While the use of mutexes is sufficient to prevent data races --
violations of the Go memory model -- many operations which need to be
atomic are performed with unbounded concurrency. For instance,
programming the encryption parameters for a peer can race with
encryption keys being updated, which could lead to inconsistencies
between the parameters programmed into the kernel and the desired state.

Overhaul the use of locks in the overlay network driver. Implement sound
locking at the network granularity: operations may proceed concurrently
iff they are being applied to distinct networks. Push the responsibility
of locking up to the code which calls methods or accesses struct fields
to avoid deadlock situations like we had previously with
d.initSandboxPeerDB() and to make the code easier to reason about.

Each overlay network has a distinct peer db. The NetworkDB watch for the
overlay peer table for the network will only start after
(*driver).CreateNetwork returns and will be stopped before libnetwork
calls (*driver).DeleteNetwork, therefore the lifetime of the peer db for
a network is constrained to the lifetime of the network itself. Yet the
peer db for a network is tracked in a dedicated map, separately from the
network objects themselves. This has resulted in a parallel set of
mutexes to manage concurrency of the peer db distinct from the mutexes
for the driver and networks. Move the peer db for a network into a field
of the network struct and guard it from concurrent access using the
per-network lock. Move the methods for manipulating the peer db into the
network struct so that the methods can only be called if the caller has
a reference to the network object.

Network creation and deletion is synchronized using the driver-scope
mutex, but some of the kernel programming is performed outside of the
critical section. It is possible for network deletion to race with
recreating the network, interleaving the kernel programming for the
network creation and deletion, resulting in inconsistent kernel state.
Parallelize network creation and deletion soundly. Use a double-checked
locking scheme to soundly handle the case of concurrent CreateNetwork
and DeleteNetwork for the same network id without blocking operations
on other networks. Synchronize operations on a network so that
operations on the network such as adding a neighbor to the peer db are
performed atomically, not interleaved with deleting the network.

Handle encryption-key updates in a critical section so they can no
longer be interleaved with kernel programming of encryption parameters.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere force-pushed the libn/overlay-soundness branch from 552d04a to a4a2a4b Compare May 29, 2025 19:37
@corhere
Copy link
Contributor Author

corhere commented Jun 23, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking/d/overlay Networking area/networking Networking impact/changelog kind/bugfix PR's that fix bugs kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. kind/refactor PR's that refactor, or clean-up code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants