mempool: Fix concurrency bug on variable recheckCursor#1116
mempool: Fix concurrency bug on variable recheckCursor#1116
Conversation
| // serial (ie. by abci responses which are called in serial). | ||
| recheckCursor *clist.CElement // next expected response | ||
| recheckEnd *clist.CElement // re-checking stops here | ||
| recheckMtx cmtsync.Mutex |
There was a problem hiding this comment.
How do we reason about the interaction of this mutex with updateMtx, as well as the client mutex? Is there perhaps a relatively easy way of visualizing it, given the complexity of the callback system?
There was a problem hiding this comment.
Yeah, it's not easy to reason about all the mutexes and I was not happy to add a new one. At least we know we plan to eliminate the variables recheckCursor and recheckEnd in #895, which will also remove the new recheckMtx mutex.
In the meantime, below is a list of all the places where these mutexes are locked. I still have to find a good way to visualize their interactions. After a first analysis I can say that updateMtx and recheckMtx could be both locked simultaneously during a brief time in the recheckTxs function, called at the end of Update. This can make globalCb to wait until recheckMtx is released, but this doesn't lead to a deadlock. CheckTx is already blocked for being executed during Update because Update has updateMtx taken. And I don't see how localClient's mutex can interfere with the other two.
CListMempool.updateMtx
updateMtx is used by the following CListMempool methods:
CheckTxupdateMtxis locked during the whole execution- calls
CheckTxAsync, which lockslocalClient's mutex - then calls
ReqRes.SetCallback, which locks it's own mutex
LockandUnlock- These methods are only used to lock
updateMtxduring the whole execution ofstate.execution.Commit, which calls, in order:mempool.FlushAppConnmempool.Update(see below)
- These methods are only used to lock
updateMtx is also locked during the whole execution of:
ReapMaxBytesMaxGasReapMaxTxsFlush
CListMempool.recheckMtx
recheckMtx is locked during the whole execution of:
initRecheckCursors, called at the beginning ofrecheckTxs, which is called at the end ofUpdate.resCbRecheck, when processingCheckTxresponses of typeRecheck.recheckCursorIsNil, called:- at the beginning of
globalCb - at the beginning of the callback function returned by
reqResCb, though it's actually not really needed here.
- at the beginning of
localClient's mtx
There are two methods locking the mutex that are relevant for the mempool:
SetResponseCallback- used to set
CListMempool.globalCbinCListMempool's constructor
- used to set
CheckTxAsync, which is called by:
-mempool.CheckTx
-mempool.recheckTxs(after callinginitRecheckCursors)
There was a problem hiding this comment.
I am not sure to understand why a new mutex is actually needed. Is it be possible, please, to have an example of bad interleaving? As far as I know, asynchronous checks are executed one at a time, whatever the client is, so this should fine.
If additional concurrency control is required, I would opt for using the already existing CListMempool.updateMtx, and not adding a new one. The rationale is that re-checks do change the fields of the mempool. Hence, they should be guarded under the same mutex as other mempool methods. As discussed here, this requires to make the mutex reentrant.
There was a problem hiding this comment.
I have tried only with updateMtx but it doesn't work. Btw, the recheck logic only changes the mempool with RemoveTxByKey, which is currently not guarded by any mutex. RemoveTxByKey is called by Update, which holds updateMtx, and by resCbRecheck that doesn't lock. We could just use updateMtx to guard RemoveTxByKey only if it is reentrant, as you said.
I added a new test TestMempoolRecheckRace that shows more clearly the problem. The test is:
- add a bunch of transactions to the mempool
- update one transaction to force rechecking the rest
- add again one transaction → this will fail with a data race for the variable
recheckCursor, which is accessed from the functionsreqResCbandrecheckTxs.
(This new test actually a simplified version of TestMempoolNoCacheOverflow, so it's a bit redundant and we can merge them.)
Co-authored-by: Thane Thomson <connect@thanethomson.com>
cason
left a comment
There was a problem hiding this comment.
I don't like using a PR to address two problems.
If there is a concurrency problem, I think a test should be added to catch the bug. Then a solution should be added to solve the problem. This is a PR by itself.
otrack
left a comment
There was a problem hiding this comment.
I have a couple of concerns regarding these changes. First, as discussed in the team meeting, it would be nice to have some feedback from the community, before we enforce any invariant between cacheSize and mempoolSize. Second, if we need to handle concurrent re-checks, I suggest to use the mempool mutex instead of adding a new one.
| // serial (ie. by abci responses which are called in serial). | ||
| recheckCursor *clist.CElement // next expected response | ||
| recheckEnd *clist.CElement // re-checking stops here | ||
| recheckMtx cmtsync.Mutex |
There was a problem hiding this comment.
I am not sure to understand why a new mutex is actually needed. Is it be possible, please, to have an example of bad interleaving? As far as I know, asynchronous checks are executed one at a time, whatever the client is, so this should fine.
If additional concurrency control is required, I would opt for using the already existing CListMempool.updateMtx, and not adding a new one. The rationale is that re-checks do change the fields of the mempool. Hence, they should be guarded under the same mutex as other mempool methods. As discussed here, this requires to make the mutex reentrant.
|
I reverted all the changes related to testing to focus the discussion on the concurrency problem on the recheck variables. So fixing the mempool capacity for testing is left for a separate issue #1144. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
Closing as there is no agreement on the proposed solution. The bug is now reported in #1827. |
There is a data race on the variable
recheckCursorinCListMempool, as demostrated by the new testTestMempoolRecheckRace. This problem only arises with the socket connection to the application.The proposed solution to the concurrency issue is to add a new mutex
recheckMtxfor accessing the variablesrecheckCursorandrecheckEnd. (Note that we want to eventually remove these variables in #895, so the mutex will be removed too.)PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code comments