Skip to content

mempool: Fix concurrency bug on variable recheckCursor#1116

Closed
hvanz wants to merge 20 commits intomainfrom
hvanz/mempool-test-capacities
Closed

mempool: Fix concurrency bug on variable recheckCursor#1116
hvanz wants to merge 20 commits intomainfrom
hvanz/mempool-test-capacities

Conversation

@hvanz
Copy link
Collaborator

@hvanz hvanz commented Jul 12, 2023

There is a data race on the variable recheckCursor in CListMempool, as demostrated by the new test TestMempoolRecheckRace. This problem only arises with the socket connection to the application.

The proposed solution to the concurrency issue is to add a new mutex recheckMtx for accessing the variables recheckCursor and recheckEnd. (Note that we want to eventually remove these variables in #895, so the mutex will be removed too.)


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments

@hvanz hvanz added the mempool label Jul 12, 2023
@hvanz hvanz self-assigned this Jul 12, 2023
@hvanz hvanz requested a review from otrack July 12, 2023 17:35
@hvanz hvanz added the wip Work in progress label Jul 12, 2023
@hvanz hvanz marked this pull request as ready for review July 13, 2023 13:28
@hvanz hvanz requested a review from a team as a code owner July 13, 2023 13:28
@hvanz hvanz requested a review from a team July 13, 2023 13:28
@hvanz hvanz removed the wip Work in progress label Jul 13, 2023
@hvanz hvanz changed the title mempool: Fix mempool capacity when testing mempool: Fix mempool capacity when testing (and fix concurrency bug) Jul 13, 2023
@hvanz hvanz added the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Jul 13, 2023
// serial (ie. by abci responses which are called in serial).
recheckCursor *clist.CElement // next expected response
recheckEnd *clist.CElement // re-checking stops here
recheckMtx cmtsync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we reason about the interaction of this mutex with updateMtx, as well as the client mutex? Is there perhaps a relatively easy way of visualizing it, given the complexity of the callback system?

Copy link
Collaborator Author

@hvanz hvanz Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not easy to reason about all the mutexes and I was not happy to add a new one. At least we know we plan to eliminate the variables recheckCursor and recheckEnd in #895, which will also remove the new recheckMtx mutex.

In the meantime, below is a list of all the places where these mutexes are locked. I still have to find a good way to visualize their interactions. After a first analysis I can say that updateMtx and recheckMtx could be both locked simultaneously during a brief time in the recheckTxs function, called at the end of Update. This can make globalCb to wait until recheckMtx is released, but this doesn't lead to a deadlock. CheckTx is already blocked for being executed during Update because Update has updateMtx taken. And I don't see how localClient's mutex can interfere with the other two.

CListMempool.updateMtx

updateMtx is used by the following CListMempool methods:

  • CheckTx
    • updateMtx is locked during the whole execution
    • calls CheckTxAsync, which locks localClient 's mutex
    • then calls ReqRes.SetCallback, which locks it's own mutex
  • Lock and Unlock
    • These methods are only used to lock updateMtx during the whole execution of state.execution.Commit, which calls, in order:
      • mempool.FlushAppConn
      • mempool.Update (see below)

updateMtx is also locked during the whole execution of:

  • ReapMaxBytesMaxGas
  • ReapMaxTxs
  • Flush

CListMempool.recheckMtx

recheckMtx is locked during the whole execution of:

  • initRecheckCursors, called at the beginning of recheckTxs, which is called at the end of Update.
  • resCbRecheck, when processing CheckTx responses of type Recheck.
  • recheckCursorIsNil, called:
    • at the beginning of globalCb
    • at the beginning of the callback function returned by reqResCb, though it's actually not really needed here.

localClient's mtx

There are two methods locking the mutex that are relevant for the mempool:

  • SetResponseCallback
    • used to set CListMempool.globalCb in CListMempool's constructor
  • CheckTxAsync, which is called by:
    - mempool.CheckTx
    - mempool.recheckTxs (after calling initRecheckCursors)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand why a new mutex is actually needed. Is it be possible, please, to have an example of bad interleaving? As far as I know, asynchronous checks are executed one at a time, whatever the client is, so this should fine.

If additional concurrency control is required, I would opt for using the already existing CListMempool.updateMtx, and not adding a new one. The rationale is that re-checks do change the fields of the mempool. Hence, they should be guarded under the same mutex as other mempool methods. As discussed here, this requires to make the mutex reentrant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried only with updateMtx but it doesn't work. Btw, the recheck logic only changes the mempool with RemoveTxByKey, which is currently not guarded by any mutex. RemoveTxByKey is called by Update, which holds updateMtx, and by resCbRecheck that doesn't lock. We could just use updateMtx to guard RemoveTxByKey only if it is reentrant, as you said.

I added a new test TestMempoolRecheckRace that shows more clearly the problem. The test is:

  1. add a bunch of transactions to the mempool
  2. update one transaction to force rechecking the rest
  3. add again one transaction → this will fail with a data race for the variable recheckCursor, which is accessed from the functions reqResCb and recheckTxs.

(This new test actually a simplified version of TestMempoolNoCacheOverflow, so it's a bit redundant and we can merge them.)

hvanz and others added 3 commits July 13, 2023 19:20
Copy link

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like using a PR to address two problems.

If there is a concurrency problem, I think a test should be added to catch the bug. Then a solution should be added to solve the problem. This is a PR by itself.

Copy link
Contributor

@otrack otrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of concerns regarding these changes. First, as discussed in the team meeting, it would be nice to have some feedback from the community, before we enforce any invariant between cacheSize and mempoolSize. Second, if we need to handle concurrent re-checks, I suggest to use the mempool mutex instead of adding a new one.

// serial (ie. by abci responses which are called in serial).
recheckCursor *clist.CElement // next expected response
recheckEnd *clist.CElement // re-checking stops here
recheckMtx cmtsync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand why a new mutex is actually needed. Is it be possible, please, to have an example of bad interleaving? As far as I know, asynchronous checks are executed one at a time, whatever the client is, so this should fine.

If additional concurrency control is required, I would opt for using the already existing CListMempool.updateMtx, and not adding a new one. The rationale is that re-checks do change the fields of the mempool. Hence, they should be guarded under the same mutex as other mempool methods. As discussed here, this requires to make the mutex reentrant.

@hvanz
Copy link
Collaborator Author

hvanz commented Jul 19, 2023

I reverted all the changes related to testing to focus the discussion on the concurrency problem on the recheck variables. So fixing the mempool capacity for testing is left for a separate issue #1144.

@hvanz hvanz changed the title mempool: Fix mempool capacity when testing (and fix concurrency bug) mempool: Fix concurrency bug on recheck variables Jul 19, 2023
@hvanz hvanz changed the title mempool: Fix concurrency bug on recheck variables mempool: Fix concurrency bug on recheck variable Jul 19, 2023
@hvanz hvanz changed the title mempool: Fix concurrency bug on recheck variable mempool: Fix concurrency bug on recheckCursor variable Jul 19, 2023
@hvanz hvanz changed the title mempool: Fix concurrency bug on recheckCursor variable mempool: Fix concurrency bug on variable recheckCursor Jul 19, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale For use by stalebot label Jul 30, 2023
@thanethomson thanethomson added wip Work in progress and removed stale For use by stalebot labels Jul 30, 2023
@hvanz hvanz changed the title mempool: Fix concurrency bug on variable recheckCursor mempool: Fix concurrency bug on variable recheckCursor [WIP] Aug 9, 2023
@lasarojc lasarojc changed the title mempool: Fix concurrency bug on variable recheckCursor [WIP] mempool: Fix concurrency bug on variable recheckCursor Nov 13, 2023
@hvanz
Copy link
Collaborator Author

hvanz commented Dec 14, 2023

Closing as there is no agreement on the proposed solution. The bug is now reported in #1827.

@hvanz hvanz closed this Dec 14, 2023
@hvanz hvanz deleted the hvanz/mempool-test-capacities branch July 11, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x mempool wip Work in progress

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants