libnetwork/networkdb: fix broadcast queue deadlocks#50344
libnetwork/networkdb: fix broadcast queue deadlocks#50344akerouanton merged 1 commit intomoby:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR addresses deadlocks in the NetworkDB broadcast queues by replacing mutex-protected node-count lookups with atomic counters.
- Introduces
estNodesand per-networknetworkNodesatomic counters to avoid lock inversion inNumNodescallbacks. - Updates
JoinNetwork,addNetworkNode, anddeleteNetworkNodeto initialize and maintain these atomic counts. - Replaces inline
NumNodesfunctions inclusterInitwith the newestNumNodescallback.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| libnetwork/networkdb/nodemgmt.go | Store overall node count in estNodes on state changes and add estNumNodes |
| libnetwork/networkdb/networkdb.go | Add atomic counters, wire up NumNodes callbacks, and update counters on join/add/delete |
| libnetwork/networkdb/cluster.go | Use estNumNodes for broadcast queues instead of locking callbacks |
Comments suppressed due to low confidence (1)
libnetwork/networkdb/networkdb.go:61
- [nitpick] New atomic counters for node counts introduce behavior that lacks dedicated tests; consider adding unit tests to verify
estNumNodes()andnetworkNodes.Load()track additions and removals correctly, including under concurrent conditions.
estNodes atomic.Int32
NetworkDB's JoinNetwork function enqueues a message onto a TransmitLimitedQueue while holding the NetworkDB mutex locked for writing. The TransmitLimitedQueue has its own synchronization; it locks its mutex when enqueueing a message. Locking order: 1. (NetworkDB).RWMutex.Lock() 2. (TransmitLimitedQueue).mu.Lock() NetworkDB's gossip periodic task calls GetBroadcasts on the same TransmitLimitedQueue to retrieve the enqueued messages. GetBroadcasts invokes the queue's NumNodes callback while the mutex is locked. The NumNodes callback function that NetworkDB sets locks the NetworkDB mutex for reading to take the length of the nodes map. Locking order: 1. (TransmitLimitedQueue).mu.Lock() 2. (NetworkDB).RWMutex.RLock() If one goroutine calls GetBroadcasts on the queue concurrently with another goroutine calling JoinNetwork on the NetworkDB, the goroutines may deadlock due to the lock inversion. Fix the deadlock by caching the number of nodes in an atomic variable so that the NumNodes callback can load the value without blocking or violating Go's memory model. And fix a similar deadlock situation with the table-event broadcast queues. Signed-off-by: Cory Snider <csnider@mirantis.com>
0d29576 to
08bde5e
Compare
|
Seeing this fail a couple of times in the BuildKit tests; could be something else that failed (output of BuildKit CI is quite verbose), or perhaps just timing/general issue? https://github.com/moby/moby/actions/runs/16126382970/job/45539201296?pr=50344 |
|
@corhere Oh sorry, I didn't pay attention to the 'Maintainer spotlight' status before merging 🙈 Is this something you still want to discuss today? |
|
@akerouanton no worries! I have been adding these to the spotlight project just to get timely reviews |
- What I did
Fixed deadlocks in NetworkDB introduced by recent changes.
(The cherry-pick labels are only so we don't forget to pick this one when we pick #50193)
- How I did it
NetworkDB's JoinNetwork function enqueues a message onto a TransmitLimitedQueue while holding the NetworkDB mutex locked for writing. The TransmitLimitedQueue has its own synchronization; it locks its mutex when enqueueing a message. Locking order:
NetworkDB's gossip periodic task calls GetBroadcasts on the same TransmitLimitedQueue to retrieve the enqueued messages. GetBroadcasts invokes the queue's NumNodes callback while the mutex is locked. The NumNodes callback function that NetworkDB sets locks the NetworkDB mutex for reading to take the length of the nodes map. Locking order:
If one goroutine calls GetBroadcasts on the queue concurrently with another goroutine calling JoinNetwork on the NetworkDB, the goroutines may deadlock due to the lock inversion.
Fix the deadlock by caching the number of nodes in an atomic variable so that the NumNodes callback can load the value without blocking or violating Go's memory model. And fix a similar deadlock situation with the table-event broadcast queues.
- How to verify it
Stress-test NetworkDB using a property-based test (#50345)
- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)