mempool: Handle concurrent requests in recheck callback by hvanz · Pull Request #895 · cometbft/cometbft

hvanz · 2023-05-30T15:28:16Z

Also, this PR simplifies the implementation of the recheck callback by eliminating the global pointers recheckCursor and recheckEnd, which were used to traverse the list of transactions while processing recheck requests. These variables were also used in other parts of the code for deciding if there were still unprocessed recheck requests. This PR replaces them by the existing txsMap, which maps tx keys to txs entries, and by a new counter variable to keep track of how many requests are still to be processed.

Before it was assumed that re-CheckTx requests were processed and handled sequentially, but this is not true for gRPC ABCI clients (see comments in tendermint/tendermint#5519).

Built on top of #894

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments

otrack · 2023-05-31T10:37:22Z

Handling in order ABCI calls at the client is mandatory. If not then a FlushAppConn operation may return earlier than a prior resCbFirstTime or resCbRecheck call. It follows that Update may concurrently execute with such a (re-)check operation. This breaks some base invariants in clist_mempool.go and/or may raise exceptions (e.g., removing twice the same element in CList.)

In my view, the gRPC ABCI client does provide the desired in-order semantics.

I would not add more concurrency to CListMempool, because it seems that there is no need for it performance wise.
In fact, I would do the opposite, that is additional locking, typically in the callbacks. In its current state, this class works because it relies heavily on (partly missing) assumptions wrt. consensus and the client application.

Nevertheless, the idea of removing the cursor is I believe a nice one because it does simplify the codebase :)

hvanz · 2023-06-05T21:26:28Z

It seems that the ABCI Flush method is not relevant at all (see tendermint/tendermint#6994). While simplifying the client interface for v0.36, it was proposed to be removed (see comments in tendermint/tendermint#7607), which finally never happened. The client interface for ABCI has a comment saying that this method should be removed as it is not implemented. And even the SDK does not implement a handler for Flush requests.

sergio-mena · 2023-06-06T17:59:27Z

I didn't go thoroughly through the patch, but have some comments:

I feel that we shouldn't backport this to v0.38.x: the QA has finished so this is not covered by those tests (it will be as part of v0.39.x QA). This changes are far from trivial and change concurrency in the mempool: an extra reason to not backport it.
Some info on some of the comments above
- the "official" reason for having the Flush ABCI call is in this section of the spec. Now, I am aware that the semantics we provide for each of the clients (socket, gRPC, local) is currently different (actually, inconsistent). But I still see Flush makes sense in the context of socket and gRPC. We can discuss this synchronously if you think it'd help
- The SDK does not implement Flush probably because it is irrelevant for a local client (SDK uses the local client, whereby ABCI calls become function calls, so you don't need to flush). It's a different story for remote clients (gRPC and socket)

hvanz · 2023-06-08T12:55:13Z

I feel that we shouldn't backport this to v0.38.x:

You're right. I removed the label.

hvanz · 2023-06-08T12:58:04Z

the "official" reason for having the Flush ABCI call is in this section of the spec.

This is what the spec says:

Before invoking Commit, CometBFT locks the mempool and flushes the mempool connection. This ensures that no new messages will be received on the mempool connection during this processing step, providing an opportunity to safely update all four connection states to the latest committed state at the same time.

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

otrack · 2023-06-08T14:35:58Z

the "official" reason for having the Flush ABCI call is in this section of the spec.

This is what the spec says:

Before invoking Commit, CometBFT locks the mempool and flushes the mempool connection. This ensures that no new messages will be received on the mempool connection during this processing step, providing an opportunity to safely update all four connection states to the latest committed state at the same time.

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

I believe that this is needed, otherwise we might end up with two concurrent remove calls of the same transaction on the txs field, leading to a panic. I will add a test based on your PR which exercices this.

To avoid the above problem, I am advocating to take locks in the callbacks. However, this poses a problem because the local client is hanging: the local client is synchronous and locks are not re-entrant in go. We should change this pattern because this client is not doing what is suppose to. However, as the SDK is using it, I wonder how tractable such a change is.

otrack · 2023-06-08T16:59:13Z

A bad interleaving that explains why we need flush (at least for the moment) is here.

otrack · 2023-06-09T17:56:56Z

Just a few more thoughts about improving the parallelism by allowing concurrent checkTx.

It seems that there is no easy way to do this. The knot of the problem is this strange pattern in the mempool where one passes a callback to another callback.

Maybe the best approach would be to have a re-entrant read/write lock for the CListMempool. Another idea is to ask AppConnMempool::CheckTxAsync to take as a parameter the callback to invoke once the application answers.

All in all, I am not sure this is a priority, given the speed of the current implementation: around 60us to check a Tx with the remote client.

sergio-mena · 2023-06-12T09:55:23Z

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

This has never been thoroughly discussed. Maybe we can include it in today's meeting?

mempool/clist_mempool.go

hvanz · 2023-06-20T08:01:51Z

It seems that there is no easy way to do this. The knot of the problem is this strange pattern in the mempool where one passes a callback to another callback.

@otrack Here's an idea to untangle this knot. The callback resCbFirstTime is implemented this way because it needs to record the sender when the transaction is added to the mempool. Each transaction is stored in a mempoolTx together with its senders. We can take out the senders of mempoolTx (and therefore out of txs) and store them in a new map txSenders from txKey to the list of senders. This map will need to be kept consistent with the transactions in the mempool. Then we could call resCbFirstTime from globalCb, and take the senders from txSenders when we need to add the transaction to the mempool. See #1010 and #1032.

melekes · 2024-01-31T05:56:32Z

@hvanz very interesting 👍 any plans to resume this?

hvanz · 2024-04-26T06:55:23Z

Closing as it is currently not an option to break the FIFO ordering when checking transactions.

hvanz added 2 commits May 30, 2023 15:20

Move mempoolTx to new file; add methods for sender

571596a

Refactor recheck callback to process concurrent requests

8908679

hvanz added mempool wip Work in progress labels May 30, 2023

hvanz self-assigned this May 30, 2023

hvanz added 4 commits May 30, 2023 23:19

Fix decreasing 0

28d1ef9

renaming

fa589a2

Fix log message

d5b7865

Update TestMempoolUpdateDoesNotPanicWhenApplicationMissedTx

ec75006

hvanz mentioned this pull request May 31, 2023

mempool: slight refactor for improving readability #894

Merged

Merge branch 'main' into hernan/mempool-concurrent-rechecks

8849916

hvanz added the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Jun 5, 2023

hvanz and others added 2 commits June 5, 2023 19:42

Add changelog file

e44437a

Merge branch 'main' into hernan/mempool-concurrent-rechecks

f5461b4

hvanz marked this pull request as ready for review June 5, 2023 17:43

hvanz requested a review from a team as a code owner June 5, 2023 17:43

hvanz removed the wip Work in progress label Jun 5, 2023

hvanz removed the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Jun 8, 2023

sergio-mena reviewed Jun 12, 2023

View reviewed changes

mempool/clist_mempool.go Outdated Show resolved Hide resolved

sergio-mena reviewed Jun 12, 2023

View reviewed changes

mempool/clist_mempool.go Outdated Show resolved Hide resolved

sergio-mena reviewed Jun 12, 2023

View reviewed changes

mempool/clist_mempool.go Outdated Show resolved Hide resolved

hvanz and others added 3 commits June 12, 2023 19:52

don't return bool in getMemTx

fea3104

remove logs

1151ec4

Merge branch 'main' into hernan/mempool-concurrent-rechecks

7162bc9

hvanz mentioned this pull request Jun 26, 2023

mempool: ADR for refactoring list of senders #1032

Merged

hvanz marked this pull request as draft June 27, 2023 08:01

hvanz added the wip Work in progress label Jun 27, 2023

hvanz mentioned this pull request Jun 29, 2023

Refactor the mempool to prepare for future improvements #1048

Closed

6 tasks

hvanz mentioned this pull request Jul 13, 2023

mempool: Fix concurrency bug on variable recheckCursor #1116

Closed

3 tasks

adizere added this to the 2024-Q2 milestone Apr 3, 2024

adizere mentioned this pull request Apr 15, 2024

Mempool Lanes: introduce QoS to the mempool [tracking issue] #2803

Closed

42 tasks

hvanz closed this Apr 26, 2024

hvanz removed the wip Work in progress label Apr 26, 2024

hvanz mentioned this pull request Apr 26, 2024

mempool: Recheck improvements and fixes #2317

Closed

6 tasks

hvanz deleted the hernan/mempool-concurrent-rechecks branch July 11, 2024 15:30

Conversation

hvanz commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

otrack commented May 31, 2023

Uh oh!

hvanz commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergio-mena commented Jun 6, 2023

Uh oh!

hvanz commented Jun 8, 2023

Uh oh!

hvanz commented Jun 8, 2023

Uh oh!

otrack commented Jun 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

otrack commented Jun 8, 2023

Uh oh!

otrack commented Jun 9, 2023

Uh oh!

sergio-mena commented Jun 12, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hvanz commented Jun 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

melekes commented Jan 31, 2024

Uh oh!

hvanz commented Apr 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hvanz commented May 30, 2023 •

edited

Loading

hvanz commented Jun 5, 2023 •

edited

Loading

otrack commented Jun 8, 2023 •

edited

Loading

hvanz commented Jun 20, 2023 •

edited

Loading