Fix race condition in consensus.State code by sergio-mena · Pull Request #673 · cometbft/cometbft

sergio-mena · 2023-04-07T20:51:22Z

Closes #487

This PR contains changes in the test infra to repro #487, and the the fix needed to avoid the race condition.
Similarly to #539, I have pushed the commits that make up this PR one by one, so that those interested can track how I repro'ed the race condition using e2e tests, and how the fix prevents the race from happening again. To do so, click on the little ❌ or ✅ next to the relevant commits.

Commit#1: Add race option to e2e tests for builtin app. We also activate "halt on error", so e2e tests are able to catch this (otherwise it just shows up in the logs, but the tests pass). e2e check detected that this commit has no changes in the node's code, so it did not build and run the tests; hence everything is still passing in this commit.
Commit#2: Added an innocuous change in the concerned code to trigger e2e tests. With this commit, e2e are failing. If the reader examines the output of the e2e tests, they will see the data race in validator05.
Commit#3: The fix. e2e test are now passing. It is worth noting that, when I ran the e2e tests locally on Commit#1, without "halt on error", I hit the race condition around 120 times across all nodes in one single run.
Commit#4: Revert Commit#1. We're not ready to leave Commit#1 in place because we know there are other race conditions lurking.

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments

This reverts commit 4f441f8.

08d2 · 2023-04-09T07:12:06Z

consensus/reactor.go

+	// We need to lock, as we are not entering consensus state from State's `handleMsg` or `handleTimeout`
+	conR.conS.mtx.Lock()
+	// We have no votes, so reconstruct LastCommit from SeenCommit
 	if state.LastBlockHeight > 0 {
 		conR.conS.reconstructLastCommit(state)
 	}

 	// NOTE: The line below causes broadcastNewRoundStepRoutine() to broadcast a
 	// NewRoundStepMessage.
 	conR.conS.updateToState(state)
+	conR.conS.mtx.Unlock()


reconstructLastCommit and updateToState can panic, which will leave conS.mtx in a locked state, producing deadlock for any concurrent lockers.

@08d2, are you deeply familiar with the SwitchToConsensus function and where it's called? If so, please point out where, exactly (which files/lines of code), you see conS.mtx being locked concurrently during this specific SwitchToConsensus call.

While 08d2 has a point (we should not panic while holding a lock), updateToState is also called while holding the same lock, as part of finalizeCommit.

By design, CometBFT never tries to recover from a panic (crash failures always being preferable to general byzantine behaviour in both crash-tolerant and BFT algorithms), so, strictly speaking, this is a non-issue.
Nevertheless, I've changed the locking I had added to a RAII-style

By design, CometBFT never tries to recover from a panic

cometbft/abci/server/socket_server.go

Lines 167 to 183 in dedfda4

defer func() {

// make sure to recover from any app-related panics to allow proper socket cleanup.

// In the case of a panic, we do not notify the client by passing an exception so

// presume that the client is still running and retying to connect

r := recover()

if r != nil {

const size = 64 << 10

buf := make([]byte, size)

buf = buf[:runtime.Stack(buf, false)]

err := fmt.Errorf("recovered from panic: %v\n%s", r, buf)

if !s.isLoggerSet {

fmt.Fprintln(os.Stderr, err)

}

closeConn <- err

s.appMtx.Unlock()

}

}()

cometbft/p2p/transport.go

Lines 304 to 320 in dedfda4

go func(c net.Conn) {

defer func() {

if r := recover(); r != nil {

err := ErrRejected{

conn: c,

err: fmt.Errorf("recovered from panic: %v", r),

isAuthFailure: true,

}

select {

case mt.acceptc <- accept{err: err}:

case <-mt.closec:

// Give up if the transport was closed.

_ = c.Close()

return

}

}

}()

cometbft/p2p/conn/connection.go

Lines 424 to 425 in dedfda4

func (c *MConnection) sendRoutine() {

defer c._recover()

@08d2, are you deeply familiar with the SwitchToConsensus function and where it's called? If so, please point out where, exactly (which files/lines of code), you see conS.mtx being locked concurrently during this specific SwitchToConsensus call.

My mistake for qualifying the error as only for "concurrent callers". This was not accurate. If any caller, concurrent or otherwise, were to panic in the ways I described previously, the conS mutex would deadlock for all subsequent callers. Panics do not reliably terminate the process.

cason

This is a short-term possible solution for the race condition observed while switching to consensus. While it is straightforward, it is similar to the first solution proposed when this problem was discovered, it may produce undesired and unforeseen consequences.

But I think we should trust Sergio's investigation and approve this solution, while investigating the whole locking and parallelism within the consensus reactor.

* Repro in e2e tests * Change something in the code * Fix race condition in `SwitchToConsensus` * Revert "Repro in e2e tests" This reverts commit 4f441f8. * RAII lock (cherry picked from commit 6a96eca)

* Repro in e2e tests * Change something in the code * Fix race condition in `SwitchToConsensus` * Revert "Repro in e2e tests" This reverts commit 4f441f8. * RAII lock (cherry picked from commit 6a96eca) Co-authored-by: Sergio Mena <sergio@informal.systems>

* Repro in e2e tests * Change something in the code * Fix race condition in `SwitchToConsensus` * Revert "Repro in e2e tests" This reverts commit 4f441f8. * RAII lock

* Repro in e2e tests * Change something in the code * Fix race condition in `SwitchToConsensus` * Revert "Repro in e2e tests" This reverts commit 4f441f8. * RAII lock (cherry picked from commit 6a96eca) Co-authored-by: Sergio Mena <sergio@informal.systems>

Repro in e2e tests

4f441f8

sergio-mena requested a review from a team as a code owner April 7, 2023 20:51

sergio-mena added the bug Something isn't working label Apr 7, 2023

sergio-mena self-assigned this Apr 7, 2023

sergio-mena added 3 commits April 7, 2023 22:58

Change something in the code

1fd106f

Fix race condition in SwitchToConsensus

1093aa4

Revert "Repro in e2e tests"

d7f108a

This reverts commit 4f441f8.

sergio-mena changed the title ~~Fix race condition in Consensus code~~ Fix race condition in consensus.State code Apr 7, 2023

08d2 reviewed Apr 9, 2023

View reviewed changes

cason approved these changes Apr 11, 2023

View reviewed changes

sergio-mena added 2 commits April 11, 2023 10:31

RAII lock

3ddc6d3

Merge branch 'main' into sergio/487-repro-fix

944e010

thanethomson approved these changes Apr 11, 2023

View reviewed changes

sergio-mena merged commit 6a96eca into main Apr 11, 2023

sergio-mena deleted the sergio/487-repro-fix branch April 11, 2023 10:01

sergio-mena added backport-to-v0.34.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Apr 11, 2023

This was referenced Apr 11, 2023

Fix race condition in consensus.State code (backport #673) #689

Merged

Fix race condition in consensus.State code (backport #673) #690

Merged

Fix race condition in consensus.State code (backport #673) #691

Merged

sergio-mena mentioned this pull request Apr 11, 2023

Fix race condition in gossipVotesRoutine #692

Merged

3 tasks

faddat mentioned this pull request Jan 5, 2024

faddat/reversions for fixed length ComposableFi/cometbft#2

Closed

3 tasks

faddat mentioned this pull request Oct 15, 2024

chore: use latest cometbft-db in v0.38.x #4296

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in consensus.State code#673

Fix race condition in consensus.State code#673
sergio-mena merged 6 commits intomainfrom
sergio/487-repro-fix

sergio-mena commented Apr 7, 2023 •

edited

Loading

Uh oh!

08d2 Apr 9, 2023

Uh oh!

thanethomson Apr 9, 2023

Uh oh!

cason Apr 11, 2023

Uh oh!

sergio-mena Apr 11, 2023

Uh oh!

08d2 Apr 12, 2023

Uh oh!

08d2 Apr 12, 2023 •

edited

Loading

Uh oh!

cason left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	defer func() {
	// make sure to recover from any app-related panics to allow proper socket cleanup.
	// In the case of a panic, we do not notify the client by passing an exception so
	// presume that the client is still running and retying to connect
	r := recover()
	if r != nil {
	const size = 64 << 10
	buf := make([]byte, size)
	buf = buf[:runtime.Stack(buf, false)]
	err := fmt.Errorf("recovered from panic: %v\n%s", r, buf)
	if !s.isLoggerSet {
	fmt.Fprintln(os.Stderr, err)
	}
	closeConn <- err
	s.appMtx.Unlock()
	}
	}()

	go func(c net.Conn) {
	defer func() {
	if r := recover(); r != nil {
	err := ErrRejected{
	conn: c,
	err: fmt.Errorf("recovered from panic: %v", r),
	isAuthFailure: true,
	}
	select {
	case mt.acceptc <- accept{err: err}:
	case <-mt.closec:
	// Give up if the transport was closed.
	_ = c.Close()
	return
	}
	}
	}()

Conversation

sergio-mena commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

08d2 Apr 9, 2023

Choose a reason for hiding this comment

Uh oh!

thanethomson Apr 9, 2023

Choose a reason for hiding this comment

Uh oh!

cason Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

sergio-mena Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

08d2 Apr 12, 2023

Choose a reason for hiding this comment

Uh oh!

08d2 Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cason left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sergio-mena commented Apr 7, 2023 •

edited

Loading

08d2 Apr 12, 2023 •

edited

Loading