mempool: Fix concurrency bug on variable recheckCursor by hvanz · Pull Request #1116 · cometbft/cometbft

hvanz · 2023-07-12T17:33:52Z

There is a data race on the variable recheckCursor in CListMempool, as demostrated by the new test TestMempoolRecheckRace. This problem only arises with the socket connection to the application.

The proposed solution to the concurrency issue is to add a new mutex recheckMtx for accessing the variables recheckCursor and recheckEnd. (Note that we want to eventually remove these variables in #895, so the mutex will be removed too.)

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments

config/config.go

thanethomson · 2023-07-13T16:02:04Z

mempool/clist_mempool.go

 	// serial (ie. by abci responses which are called in serial).
 	recheckCursor *clist.CElement // next expected response
 	recheckEnd    *clist.CElement // re-checking stops here
+	recheckMtx    cmtsync.Mutex


How do we reason about the interaction of this mutex with updateMtx, as well as the client mutex? Is there perhaps a relatively easy way of visualizing it, given the complexity of the callback system?

Yeah, it's not easy to reason about all the mutexes and I was not happy to add a new one. At least we know we plan to eliminate the variables recheckCursor and recheckEnd in #895, which will also remove the new recheckMtx mutex.

In the meantime, below is a list of all the places where these mutexes are locked. I still have to find a good way to visualize their interactions. After a first analysis I can say that updateMtx and recheckMtx could be both locked simultaneously during a brief time in the recheckTxs function, called at the end of Update. This can make globalCb to wait until recheckMtx is released, but this doesn't lead to a deadlock. CheckTx is already blocked for being executed during Update because Update has updateMtx taken. And I don't see how localClient's mutex can interfere with the other two.

CListMempool.updateMtx

updateMtx is used by the following CListMempool methods:

CheckTx

updateMtx is locked during the whole execution

calls CheckTxAsync, which locks localClient 's mutex

then calls ReqRes.SetCallback, which locks it's own mutex

Lock and Unlock

These methods are only used to lock updateMtx during the whole execution of state.execution.Commit, which calls, in order:

mempool.FlushAppConn

mempool.Update (see below)

updateMtx is also locked during the whole execution of:

ReapMaxBytesMaxGas

ReapMaxTxs

Flush

CListMempool.recheckMtx

recheckMtx is locked during the whole execution of:

initRecheckCursors, called at the beginning of recheckTxs, which is called at the end of Update.

resCbRecheck, when processing CheckTx responses of type Recheck.

recheckCursorIsNil, called:

at the beginning of globalCb

at the beginning of the callback function returned by reqResCb, though it's actually not really needed here.

localClient's mtx

There are two methods locking the mutex that are relevant for the mempool:

SetResponseCallback

used to set CListMempool.globalCb in CListMempool's constructor

CheckTxAsync, which is called by:
- mempool.CheckTx
- mempool.recheckTxs (after calling initRecheckCursors)

I am not sure to understand why a new mutex is actually needed. Is it be possible, please, to have an example of bad interleaving? As far as I know, asynchronous checks are executed one at a time, whatever the client is, so this should fine.

If additional concurrency control is required, I would opt for using the already existing CListMempool.updateMtx, and not adding a new one. The rationale is that re-checks do change the fields of the mempool. Hence, they should be guarded under the same mutex as other mempool methods. As discussed here, this requires to make the mutex reentrant.

I have tried only with updateMtx but it doesn't work. Btw, the recheck logic only changes the mempool with RemoveTxByKey, which is currently not guarded by any mutex. RemoveTxByKey is called by Update, which holds updateMtx, and by resCbRecheck that doesn't lock. We could just use updateMtx to guard RemoveTxByKey only if it is reentrant, as you said.

I added a new test TestMempoolRecheckRace that shows more clearly the problem. The test is:

add a bunch of transactions to the mempool

update one transaction to force rechecking the rest

add again one transaction → this will fail with a data race for the variable recheckCursor, which is accessed from the functions reqResCb and recheckTxs.

(This new test actually a simplified version of TestMempoolNoCacheOverflow, so it's a bit redundant and we can merge them.)

Co-authored-by: Thane Thomson <connect@thanethomson.com>

cason

I don't like using a PR to address two problems.

If there is a concurrency problem, I think a test should be added to catch the bug. Then a solution should be added to solve the problem. This is a PR by itself.

config/config.go

mempool/clist_mempool.go

otrack

I have a couple of concerns regarding these changes. First, as discussed in the team meeting, it would be nice to have some feedback from the community, before we enforce any invariant between cacheSize and mempoolSize. Second, if we need to handle concurrent re-checks, I suggest to use the mempool mutex instead of adding a new one.

otrack · 2023-07-18T07:58:06Z

mempool/clist_mempool.go

 	// serial (ie. by abci responses which are called in serial).
 	recheckCursor *clist.CElement // next expected response
 	recheckEnd    *clist.CElement // re-checking stops here
+	recheckMtx    cmtsync.Mutex


I am not sure to understand why a new mutex is actually needed. Is it be possible, please, to have an example of bad interleaving? As far as I know, asynchronous checks are executed one at a time, whatever the client is, so this should fine.

If additional concurrency control is required, I would opt for using the already existing CListMempool.updateMtx, and not adding a new one. The rationale is that re-checks do change the fields of the mempool. Hence, they should be guarded under the same mutex as other mempool methods. As discussed here, this requires to make the mutex reentrant.

mempool/clist_mempool_test.go

hvanz · 2023-07-19T10:46:30Z

I reverted all the changes related to testing to focus the discussion on the concurrency problem on the recheck variables. So fixing the mempool capacity for testing is left for a separate issue #1144.

github-actions · 2023-07-30T00:15:15Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

hvanz · 2023-12-14T10:27:55Z

Closing as there is no agreement on the proposed solution. The bug is now reported in #1827.

Set mempool's size = cache's size for testing

26d4090

hvanz added the mempool label Jul 12, 2023

hvanz self-assigned this Jul 12, 2023

hvanz requested a review from otrack July 12, 2023 17:35

hvanz added 4 commits July 12, 2023 22:43

Fix TestMempoolTxConcurrentWithCommit

301c4bd

Introduce recheckMtx 😬

6922c64

renaming

c2b2651

Try to fix TestReactorConcurrency

2d8dbed

hvanz added the wip Work in progress label Jul 12, 2023

Try to fix TestReactorConcurrency

911d835

hvanz mentioned this pull request Jul 12, 2023

Refactor the mempool to prepare for future improvements #1048

Closed

6 tasks

hvanz added 3 commits July 13, 2023 16:00

Fix config being overwritten

0f97d51

Cosmetic changes

e2078d2

Add changlog

87a9879

hvanz marked this pull request as ready for review July 13, 2023 13:28

hvanz requested a review from a team as a code owner July 13, 2023 13:28

hvanz requested a review from a team July 13, 2023 13:28

hvanz removed the wip Work in progress label Jul 13, 2023

hvanz changed the title ~~mempool: Fix mempool capacity when testing~~ mempool: Fix mempool capacity when testing (and fix concurrency bug) Jul 13, 2023

hvanz added the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Jul 13, 2023

logging, comments

a97f6a0

thanethomson reviewed Jul 13, 2023

View reviewed changes

hvanz and others added 3 commits July 13, 2023 19:20

Update config/config.go

de8c525

Co-authored-by: Thane Thomson <connect@thanethomson.com>

Add initRecheckCursors

f9159f1

Add function recheckCursorIsNil

4bcbf15

cason reviewed Jul 17, 2023

View reviewed changes

hvanz mentioned this pull request Jul 17, 2023

mempool: Keep track of senders in reactor instead of implementation #1010

Merged

3 tasks

otrack reviewed Jul 18, 2023

View reviewed changes

hvanz added 2 commits July 19, 2023 10:44

Add TestMempoolRecheckRace

8f21c23

Revert changes in comments and logs

b63cde2

hvanz added 5 commits July 19, 2023 12:48

Revert renaming in test

f4eee07

require.NoError instead of Nil

6b5be38

Revert cfg.CacheSize < cfg.Size

af300d0

Merge branch 'main' into hvanz/mempool-test-capacities

934b01d

Revert all changes to tests and cache/mempool sizes

1534263

hvanz changed the title ~~mempool: Fix mempool capacity when testing (and fix concurrency bug)~~ mempool: Fix concurrency bug on recheck variables Jul 19, 2023

hvanz changed the title ~~mempool: Fix concurrency bug on recheck variables~~ mempool: Fix concurrency bug on recheck variable Jul 19, 2023

hvanz changed the title ~~mempool: Fix concurrency bug on recheck variable~~ mempool: Fix concurrency bug on recheckCursor variable Jul 19, 2023

hvanz changed the title ~~mempool: Fix concurrency bug on recheckCursor variable~~ mempool: Fix concurrency bug on variable recheckCursor Jul 19, 2023

github-actions bot added the stale For use by stalebot label Jul 30, 2023

thanethomson added wip Work in progress and removed stale For use by stalebot labels Jul 30, 2023

hvanz changed the title ~~mempool: Fix concurrency bug on variable recheckCursor~~ mempool: Fix concurrency bug on variable recheckCursor [WIP] Aug 9, 2023

lasarojc changed the title ~~mempool: Fix concurrency bug on variable recheckCursor [WIP]~~ mempool: Fix concurrency bug on variable recheckCursor Nov 13, 2023

hvanz mentioned this pull request Dec 14, 2023

mempool: Data race when rechecking with socket connection #1827

Closed

hvanz closed this Dec 14, 2023

hvanz deleted the hvanz/mempool-test-capacities branch July 11, 2024 15:30

Conversation

hvanz commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

Uh oh!

thanethomson Jul 13, 2023

Choose a reason for hiding this comment

Uh oh!

hvanz Jul 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CListMempool.updateMtx

CListMempool.recheckMtx

localClient's mtx

Uh oh!

otrack Jul 18, 2023

Choose a reason for hiding this comment

Uh oh!

hvanz Jul 19, 2023

Choose a reason for hiding this comment

Uh oh!

cason left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

otrack left a comment

Choose a reason for hiding this comment

Uh oh!

otrack Jul 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hvanz commented Jul 19, 2023

Uh oh!

github-actions bot commented Jul 30, 2023

Uh oh!

hvanz commented Dec 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hvanz commented Jul 12, 2023 •

edited

Loading

hvanz Jul 14, 2023 •

edited

Loading

`CListMempool.updateMtx`

`CListMempool.recheckMtx`

`localClient`'s `mtx`