Skip to content

libnetwork/d/overlay: fix logical race conditions#50260

Merged
corhere merged 4 commits intomoby:masterfrom
corhere:libn/overlay-fix-logical-races
Jul 10, 2025
Merged

libnetwork/d/overlay: fix logical race conditions#50260
corhere merged 4 commits intomoby:masterfrom
corhere:libn/overlay-fix-logical-races

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Jun 23, 2025

- What I did
- How I did it

The concurrency control in the overlay driver is logically unsound. While the use of mutexes is sufficient to prevent data races -- violations of the Go memory model -- many operations which need to be atomic are performed with unbounded concurrency. For instance, programming the encryption parameters for a peer can race with encryption keys being updated, which could lead to inconsistencies between the parameters programmed into the kernel and the desired state.

Overhaul the use of locks in the overlay network driver. Implement sound locking at the network granularity: operations may proceed concurrently iff they are being applied to distinct networks. Push the responsibility of locking up to the code which calls methods or accesses struct fields to avoid deadlock situations like we had previously with d.initSandboxPeerDB() and to make the code easier to reason about.

Each overlay network has a distinct peer db. The NetworkDB watch for the overlay peer table for the network will only start after (*driver).CreateNetwork returns and will be stopped before libnetwork calls (*driver).DeleteNetwork, therefore the lifetime of the peer db for a network is constrained to the lifetime of the network itself. Yet the peer db for a network is tracked in a dedicated map, separately from the network objects themselves. This has resulted in a parallel set of mutexes to manage concurrency of the peer db distinct from the mutexes for the driver and networks. Move the peer db for a network into a field of the network struct and guard it from concurrent access using the per-network lock. Move the methods for manipulating the peer db into the network struct so that the methods can only be called if the caller has a reference to the network object.

Network creation and deletion is synchronized using the driver-scope mutex, but some of the kernel programming is performed outside of the critical section. It is possible for network deletion to race with recreating the network, interleaving the kernel programming for the network creation and deletion, resulting in inconsistent kernel state. Parallelize network creation and deletion soundly. Use a double-checked locking scheme to soundly handle the case of concurrent CreateNetwork and DeleteNetwork for the same network id without blocking operations on other networks. Synchronize operations on a network so that operations on the network such as adding a neighbor to the peer db are performed atomically, not interleaved with deleting the network.

Handle encryption-key updates in a critical section so they can no longer be interleaved with kernel programming of encryption parameters.

- How to verify it

  1. Create a multi-node Swarm cluster.
  2. Create a user-defined overlay network.
  3. docker service create --mode global --network my-overlay -p 80:8080 nginxdemos/hello:plain-text
  4. Exec into one of the service containers and verify that the other containers can be curl'ed by the IP address of its user-defined overlay network endpoint. Repeat with the other replicas.
  5. curl port 8080 on each Swarm node. Verify that the service is reachable. Repeat to verify that the requests are load-balanced among the replicas.

- Human readable description for the release notes

- Improve the reliability of the overlay network driver

- A picture of a cute animal (not mandatory but encouraged)

@corhere corhere added this to the 29.0.0 milestone Jun 23, 2025
@corhere corhere added kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/networking Networking impact/changelog kind/bugfix PR's that fix bugs kind/refactor PR's that refactor, or clean-up code area/networking/d/overlay Networking labels Jun 23, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Improve overlay driver concurrency by introducing per-network locking, integrating peer databases into network structs, and synchronizing encryption state updates.

  • Added countmap utility for tracking and auto-removing zero counts, with unit tests.
  • Refactored peer database (peerdb.go), moving it into each network and replacing global locks with per-network mu.
  • Introduced driver.mu as the bottom-level lock, implemented lockNetwork for safe network operations, and synchronized encryption via encrMap.mu.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
libnetwork/internal/countmap/countmap.go New generic Map[T] with Add that auto-removes zero entries
libnetwork/internal/countmap/countmap_test.go Tests for Map.Add, verifying count changes and removal
libnetwork/drivers/overlay/peerdb.go Moved peer db into peerMap, removed global lock, refactored Walk, Get, Add, Delete
libnetwork/drivers/overlay/overlay.go Reorganized driver struct, introduced bottom-level mu, replaced old locks
libnetwork/drivers/overlay/ov_network.go Added network.mu, integrated peerdb and fdbCnt, implemented lockNetwork, DCL in creation
libnetwork/drivers/overlay/ov_endpoint.go Switched to lockNetwork for endpoint ops, removed manual lock helpers
libnetwork/drivers/overlay/joinleave.go Updated Join/Leave/EventNotify to use per-network lock and new peer db API
libnetwork/drivers/overlay/encryption.go Added encrMap.mu, refactored encryption key management, synchronized SA/Policy updates
Comments suppressed due to low confidence (3)

libnetwork/drivers/overlay/encryption.go:286

  • [nitpick] The parameter name spi shadows the type spi, which can be confusing; consider renaming the parameter to e.g. sp or idx.
func programSA(localIP, remoteIP net.IP, spi spi, k *key, dir int, add bool) (fSA *netlink.XfrmState, rSA *netlink.XfrmState, lastErr error) {

libnetwork/drivers/overlay/ov_network.go:243

  • Missing space after colon and use %w for wrapping: change to "could not delete neighbor entry in the sandbox: %w".
					"network_id": n.id,

libnetwork/drivers/overlay/peerdb.go:12

  • The import "syscall" is unused after the refactoring and will cause a compile error; please remove it.
	"syscall"

@corhere corhere force-pushed the libn/overlay-fix-logical-races branch from 5aef59b to b15a542 Compare June 24, 2025 17:31
@corhere corhere force-pushed the libn/overlay-fix-logical-races branch from b15a542 to d3c2d7b Compare July 2, 2025 22:11
@corhere corhere marked this pull request as ready for review July 2, 2025 22:11
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left some thoughts / ideas; happy to hear your thoughts!

if n.secure && len(d.keys) == 0 {
return errors.New("cannot join secure network: encryption keys not present")
if n.secure {
d.secMap.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to some extent, secMap could follow the same pattern, but mostly to not have to "dive into it's internals to know about the lock"), so not a strong objection / issue.

sm, release := d.lockSecMap()
defer release()
sm.keys = append(....)
// other things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secMap doesn't have any methods (aside from implementing the Stringer interface) -- its internals and its interface are one and the same! There is no avoiding "diving into its internals" so I see little benefit to adding boilerplate around locking a mutex.

@@ -172,8 +203,6 @@ func (d *driver) CreateNetwork(ctx context.Context, id string, option map[string
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but curious if the TODO above still holds

if err := nInfo.TableEventRegister(ovPeerTable, driverapi.EndpointObject); err != nil {
// XXX Undo writeToStore? No method to so. Why?
return err
}

Looks like there was a n.writeToStore() at the time; b64997e but I don't see that anymore;

if err := n.writeToStore(); err != nil {
return fmt.Errorf("failed to update data store for network %v: %v", n.id, err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writeToStore call was removed in 0fa873c. The TODO comment is no longer relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corhere corhere force-pushed the libn/overlay-fix-logical-races branch from d3c2d7b to aa88168 Compare July 7, 2025 21:47
@corhere corhere requested a review from thaJeztah July 8, 2025 12:47
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Left a comment about the walkNodes method; non-blocking, but let me know if they make sense.

@corhere corhere force-pushed the libn/overlay-fix-logical-races branch from aa88168 to f9daf69 Compare July 8, 2025 17:48
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still LGTM, thx!

@robmry @akerouanton ptal

Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just nits/questions ...

corhere added 2 commits July 9, 2025 12:53
It is easier to find all references when they are struct fields rather
than embedded structs.

Signed-off-by: Cory Snider <csnider@mirantis.com>
func (*driver) secMapWalk is a curious beast. It is named walk, yet it
also mutates the collection being iterated over. It returns an error,
but that error is always nil. It takes a callback that can break
iteration, yet the only caller makes no use of that affordance. Its
utility is limited and the abstraction hinders readability more than it
helps. Open-code the d.secMap.nodes loop into
func (*driver) updateKeys(), the only caller.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere force-pushed the libn/overlay-fix-logical-races branch from f9daf69 to 89b6227 Compare July 9, 2025 18:10
There is a dedicated mutex for synchronizing access to the encrMap.
Separately, the main driver mutex is used for synchronizing access to
the encryption keys. Their use is sufficient to prevent data races (if
used correctly, which is not the case) but not logical race conditions.
Programming the encryption parameters for a peer can race with
encryption keys being updated, which could lead to inconsistencies
between the parameters programmed into the kernel and the desired state.

Introduce a new mutex for synchronizing encryption operations. Use that
mutex to synchronize access to both encrMap and keys. Handle encryption
key updates in a critical section so they can no longer be interleaved
with kernel programming of encryption parameters.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The concurrency control in the overlay driver is logically unsound.
While the use of mutexes is sufficient to prevent data races --
violations of the Go memory model -- many operations which need to be
atomic are performed with unbounded concurrency.

Overhaul the use of locks in the overlay network driver. Implement sound
locking at the network granularity: operations may proceed concurrently
iff they are being applied to distinct networks. Push the responsibility
of locking up to the code which calls methods or accesses struct fields
to avoid deadlock situations like we had previously with
d.initSandboxPeerDB() and to make the code easier to reason about.

Each overlay network has a distinct peer db. The NetworkDB watch for the
overlay peer table for the network will only start after
(*driver).CreateNetwork returns and will be stopped before libnetwork
calls (*driver).DeleteNetwork, therefore the lifetime of the peer db for
a network is constrained to the lifetime of the network itself. Yet the
peer db for a network is tracked in a dedicated map, separately from the
network objects themselves. This has resulted in a parallel set of
mutexes to manage concurrency of the peer db distinct from the mutexes
for the driver and networks. Move the peer db for a network into a field
of the network struct and guard it from concurrent access using the
per-network lock. Move the methods for manipulating the peer db into the
network struct so that the methods can only be called if the caller has
a reference to the network object.

Network creation and deletion are synchronized using the driver-scope
mutex, but some of the kernel programming is performed outside of the
critical section. It is possible for network deletion to race with
recreating the network, interleaving the kernel programming for the
network creation and deletion, resulting in inconsistent kernel state.
Parallelize network creation and deletion soundly. Use a double-checked
locking scheme to soundly handle the case of concurrent CreateNetwork
and DeleteNetwork for the same network id without blocking operations
on other networks. Synchronize operations on a network so that
operations on the network such as adding a neighbor to the peer db are
performed atomically, not interleaved with deleting the network.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere force-pushed the libn/overlay-fix-logical-races branch from 89b6227 to 89d3419 Compare July 9, 2025 18:13
Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - still LGTM.

@akerouanton
Copy link
Member

I have a few pending comments but I need to take another look tomorrow.

Copy link
Member

@akerouanton akerouanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside my two comments, LGTM.

if etype == driverapi.Delete {
opname = "delete"
err = n.peerDelete(eid, addr, mac, vtep)
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create and Update are treated the same. Could you add a comment explaining why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the way it was before I started refactoring. 🤷
I only touched this part of the code to refactor the error handling. The control flow is the same as it was before.

if etype == driverapi.Delete {
d.peerDelete(nid, eid, addr, mac, vtep)
return
}
d.peerAdd(nid, eid, addr, mac, vtep)

Come to think of it, it is likely subtly incorrect to handle creates the same as updates. While the driver never performs an update operation on an overlay_peer_table entry (libnetwork doesn't afford a mechanism for drivers to do so) it could still be problematic if it is possible for the same endpoint ID to leave and rejoin a network, as the DELETE and re-CREATE could be coalesced into a single UPDATE if a peer receives the CREATE event first. This might be one of the root causes of #49908! I should fix that.

defer func() {
d.mu.Unlock()
if doPeerFlush {
d.peerFlush(nid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't flush anymore — is it unneeded in the end? Or replaced by something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been replaced by delete(d.networks, nid) as the peerdb for a network is now a field on the network struct, not a key in another map that needs to be cleaned up explicitly.

@corhere corhere merged commit 5de59ca into moby:master Jul 10, 2025
247 of 251 checks passed
@corhere corhere deleted the libn/overlay-fix-logical-races branch July 10, 2025 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking/d/overlay Networking area/networking Networking impact/changelog kind/bugfix PR's that fix bugs kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. kind/refactor PR's that refactor, or clean-up code process/cherry-pick/25.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants