Fix race condition in consensus.State code#673
Conversation
consensus/reactor.go
Outdated
| // We need to lock, as we are not entering consensus state from State's `handleMsg` or `handleTimeout` | ||
| conR.conS.mtx.Lock() | ||
| // We have no votes, so reconstruct LastCommit from SeenCommit | ||
| if state.LastBlockHeight > 0 { | ||
| conR.conS.reconstructLastCommit(state) | ||
| } | ||
|
|
||
| // NOTE: The line below causes broadcastNewRoundStepRoutine() to broadcast a | ||
| // NewRoundStepMessage. | ||
| conR.conS.updateToState(state) | ||
| conR.conS.mtx.Unlock() |
There was a problem hiding this comment.
reconstructLastCommit and updateToState can panic, which will leave conS.mtx in a locked state, producing deadlock for any concurrent lockers.
There was a problem hiding this comment.
@08d2, are you deeply familiar with the SwitchToConsensus function and where it's called? If so, please point out where, exactly (which files/lines of code), you see conS.mtx being locked concurrently during this specific SwitchToConsensus call.
There was a problem hiding this comment.
While 08d2 has a point (we should not panic while holding a lock), updateToState is also called while holding the same lock, as part of finalizeCommit.
There was a problem hiding this comment.
By design, CometBFT never tries to recover from a panic (crash failures always being preferable to general byzantine behaviour in both crash-tolerant and BFT algorithms), so, strictly speaking, this is a non-issue.
Nevertheless, I've changed the locking I had added to a RAII-style
There was a problem hiding this comment.
By design, CometBFT never tries to recover from a panic
cometbft/abci/server/socket_server.go
Lines 167 to 183 in dedfda4
Lines 304 to 320 in dedfda4
cometbft/p2p/conn/connection.go
Lines 424 to 425 in dedfda4
There was a problem hiding this comment.
@08d2, are you deeply familiar with the
SwitchToConsensusfunction and where it's called? If so, please point out where, exactly (which files/lines of code), you seeconS.mtxbeing locked concurrently during this specificSwitchToConsensuscall.
My mistake for qualifying the error as only for "concurrent callers". This was not accurate. If any caller, concurrent or otherwise, were to panic in the ways I described previously, the conS mutex would deadlock for all subsequent callers. Panics do not reliably terminate the process.
cason
left a comment
There was a problem hiding this comment.
This is a short-term possible solution for the race condition observed while switching to consensus. While it is straightforward, it is similar to the first solution proposed when this problem was discovered, it may produce undesired and unforeseen consequences.
But I think we should trust Sergio's investigation and approve this solution, while investigating the whole locking and parallelism within the consensus reactor.
* Repro in e2e tests * Change something in the code * Fix race condition in `SwitchToConsensus` * Revert "Repro in e2e tests" This reverts commit 4f441f8. * RAII lock
Closes #487
This PR contains changes in the test infra to repro #487, and the the fix needed to avoid the race condition.
Similarly to #539, I have pushed the commits that make up this PR one by one, so that those interested can track how I repro'ed the race condition using e2e tests, and how the fix prevents the race from happening again. To do so, click on the little ❌ or ✅ next to the relevant commits.
validator05.PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code comments