Skip to content

Fix race condition in consensus.State code#673

Merged
sergio-mena merged 6 commits intomainfrom
sergio/487-repro-fix
Apr 11, 2023
Merged

Fix race condition in consensus.State code#673
sergio-mena merged 6 commits intomainfrom
sergio/487-repro-fix

Conversation

@sergio-mena
Copy link
Copy Markdown
Collaborator

@sergio-mena sergio-mena commented Apr 7, 2023

Closes #487

This PR contains changes in the test infra to repro #487, and the the fix needed to avoid the race condition.
Similarly to #539, I have pushed the commits that make up this PR one by one, so that those interested can track how I repro'ed the race condition using e2e tests, and how the fix prevents the race from happening again. To do so, click on the little ❌ or ✅ next to the relevant commits.

  • Commit#1: Add race option to e2e tests for builtin app. We also activate "halt on error", so e2e tests are able to catch this (otherwise it just shows up in the logs, but the tests pass). e2e check detected that this commit has no changes in the node's code, so it did not build and run the tests; hence everything is still passing in this commit.
  • Commit#2: Added an innocuous change in the concerned code to trigger e2e tests. With this commit, e2e are failing. If the reader examines the output of the e2e tests, they will see the data race in validator05.
  • Commit#3: The fix. e2e test are now passing. It is worth noting that, when I ran the e2e tests locally on Commit#1, without "halt on error", I hit the race condition around 120 times across all nodes in one single run.
  • Commit#4: Revert Commit#1. We're not ready to leave Commit#1 in place because we know there are other race conditions lurking.

PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments

@sergio-mena sergio-mena requested a review from a team as a code owner April 7, 2023 20:51
@sergio-mena sergio-mena added the bug Something isn't working label Apr 7, 2023
@sergio-mena sergio-mena self-assigned this Apr 7, 2023
@sergio-mena sergio-mena changed the title Fix race condition in Consensus code Fix race condition in consensus.State code Apr 7, 2023
Comment on lines +110 to +120
// We need to lock, as we are not entering consensus state from State's `handleMsg` or `handleTimeout`
conR.conS.mtx.Lock()
// We have no votes, so reconstruct LastCommit from SeenCommit
if state.LastBlockHeight > 0 {
conR.conS.reconstructLastCommit(state)
}

// NOTE: The line below causes broadcastNewRoundStepRoutine() to broadcast a
// NewRoundStepMessage.
conR.conS.updateToState(state)
conR.conS.mtx.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconstructLastCommit and updateToState can panic, which will leave conS.mtx in a locked state, producing deadlock for any concurrent lockers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@08d2, are you deeply familiar with the SwitchToConsensus function and where it's called? If so, please point out where, exactly (which files/lines of code), you see conS.mtx being locked concurrently during this specific SwitchToConsensus call.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While 08d2 has a point (we should not panic while holding a lock), updateToState is also called while holding the same lock, as part of finalizeCommit.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design, CometBFT never tries to recover from a panic (crash failures always being preferable to general byzantine behaviour in both crash-tolerant and BFT algorithms), so, strictly speaking, this is a non-issue.
Nevertheless, I've changed the locking I had added to a RAII-style

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design, CometBFT never tries to recover from a panic

defer func() {
// make sure to recover from any app-related panics to allow proper socket cleanup.
// In the case of a panic, we do not notify the client by passing an exception so
// presume that the client is still running and retying to connect
r := recover()
if r != nil {
const size = 64 << 10
buf := make([]byte, size)
buf = buf[:runtime.Stack(buf, false)]
err := fmt.Errorf("recovered from panic: %v\n%s", r, buf)
if !s.isLoggerSet {
fmt.Fprintln(os.Stderr, err)
}
closeConn <- err
s.appMtx.Unlock()
}
}()

cometbft/p2p/transport.go

Lines 304 to 320 in dedfda4

go func(c net.Conn) {
defer func() {
if r := recover(); r != nil {
err := ErrRejected{
conn: c,
err: fmt.Errorf("recovered from panic: %v", r),
isAuthFailure: true,
}
select {
case mt.acceptc <- accept{err: err}:
case <-mt.closec:
// Give up if the transport was closed.
_ = c.Close()
return
}
}
}()

func (c *MConnection) sendRoutine() {
defer c._recover()

Copy link
Copy Markdown

@08d2 08d2 Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@08d2, are you deeply familiar with the SwitchToConsensus function and where it's called? If so, please point out where, exactly (which files/lines of code), you see conS.mtx being locked concurrently during this specific SwitchToConsensus call.

My mistake for qualifying the error as only for "concurrent callers". This was not accurate. If any caller, concurrent or otherwise, were to panic in the ways I described previously, the conS mutex would deadlock for all subsequent callers. Panics do not reliably terminate the process.

Copy link
Copy Markdown

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a short-term possible solution for the race condition observed while switching to consensus. While it is straightforward, it is similar to the first solution proposed when this problem was discovered, it may produce undesired and unforeseen consequences.

But I think we should trust Sergio's investigation and approve this solution, while investigating the whole locking and parallelism within the consensus reactor.

@sergio-mena sergio-mena merged commit 6a96eca into main Apr 11, 2023
@sergio-mena sergio-mena deleted the sergio/487-repro-fix branch April 11, 2023 10:01
@sergio-mena sergio-mena added backport-to-v0.34.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Apr 11, 2023
mergify bot pushed a commit that referenced this pull request Apr 11, 2023
* Repro in e2e tests

* Change something in the code

* Fix race condition in `SwitchToConsensus`

* Revert "Repro in e2e tests"

This reverts commit 4f441f8.

* RAII lock

(cherry picked from commit 6a96eca)
mergify bot pushed a commit that referenced this pull request Apr 11, 2023
* Repro in e2e tests

* Change something in the code

* Fix race condition in `SwitchToConsensus`

* Revert "Repro in e2e tests"

This reverts commit 4f441f8.

* RAII lock

(cherry picked from commit 6a96eca)
mergify bot pushed a commit that referenced this pull request Apr 11, 2023
* Repro in e2e tests

* Change something in the code

* Fix race condition in `SwitchToConsensus`

* Revert "Repro in e2e tests"

This reverts commit 4f441f8.

* RAII lock

(cherry picked from commit 6a96eca)
sergio-mena added a commit that referenced this pull request Apr 11, 2023
* Repro in e2e tests

* Change something in the code

* Fix race condition in `SwitchToConsensus`

* Revert "Repro in e2e tests"

This reverts commit 4f441f8.

* RAII lock

(cherry picked from commit 6a96eca)

Co-authored-by: Sergio Mena <sergio@informal.systems>
sergio-mena added a commit that referenced this pull request Apr 11, 2023
* Repro in e2e tests

* Change something in the code

* Fix race condition in `SwitchToConsensus`

* Revert "Repro in e2e tests"

This reverts commit 4f441f8.

* RAII lock

(cherry picked from commit 6a96eca)

Co-authored-by: Sergio Mena <sergio@informal.systems>
sergio-mena added a commit that referenced this pull request Apr 11, 2023
* Repro in e2e tests

* Change something in the code

* Fix race condition in `SwitchToConsensus`

* Revert "Repro in e2e tests"

This reverts commit 4f441f8.

* RAII lock

(cherry picked from commit 6a96eca)

Co-authored-by: Sergio Mena <sergio@informal.systems>
thanethomson pushed a commit that referenced this pull request May 6, 2023
* Repro in e2e tests

* Change something in the code

* Fix race condition in `SwitchToConsensus`

* Revert "Repro in e2e tests"

This reverts commit 4f441f8.

* RAII lock
roy-dydx pushed a commit to dydxprotocol/cometbft that referenced this pull request Jul 11, 2023
* Repro in e2e tests

* Change something in the code

* Fix race condition in `SwitchToConsensus`

* Revert "Repro in e2e tests"

This reverts commit 4f441f8.

* RAII lock

(cherry picked from commit 6a96eca)

Co-authored-by: Sergio Mena <sergio@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x bug Something isn't working

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

Intermittent data race in consensus.State during startup

4 participants