Fix deadlock on block cache.#6412
Conversation
|
Thread dump after node is stuck for future reference: |
michaelsproul
left a comment
There was a problem hiding this comment.
LGTM, nice find.
I would love to get the deadlock detector running again
|
Looks like |
|
@mergify queue |
🛑 The pull request has been removed from the queue
|
|
CI is failing because the Windows runner was updated to 1.81 and we need: |
|
@mergify requeue |
✅ This pull request will be re-embarked automaticallyDetailsThe followup |
✅ The pull request has been merged automaticallyDetailsThe pull request has been merged automatically at 46e0d66 |
* Fix deadlock on block cache.
Issue Addressed
I ran into another deadlock scenario again today during testing.
On thread A, read lock (1) for
block_cacheis acquired here:lighthouse/beacon_node/beacon_chain/src/eth1_chain.rs
Line 478 in 99e53b8
lighthouse/beacon_node/eth1/src/service.rs
Lines 477 to 479 in 9b3b730
On thread B, a write lock (2) for
block_cacheis acquired inprune_blocks, so it waits for (1) to release the locklighthouse/beacon_node/eth1/src/inner.rs
Line 63 in c824142
On thread A, a read lock (3) for
block_cacheis acquired again, now this is waiting on (2), but (1) may not have release the lock before this, so we're in a deadlock.lighthouse/beacon_node/beacon_chain/src/eth1_chain.rs
Line 516 in 99e53b8
lighthouse/beacon_node/eth1/src/service.rs
Lines 499 to 501 in 9b3b730
Strangely it's on code that hasn't been touched for ages, maybe it's more easily reproducible when resources are very constrained (I'm running a kurtosis devnet).
Proposed Changes
Drop the
block_cacheat (1) after using, and before it gets acquired again.