Skip to content

mempool: Handle concurrent requests in recheck callback#895

Closed
hvanz wants to merge 12 commits intomainfrom
hernan/mempool-concurrent-rechecks
Closed

mempool: Handle concurrent requests in recheck callback#895
hvanz wants to merge 12 commits intomainfrom
hernan/mempool-concurrent-rechecks

Conversation

@hvanz
Copy link
Collaborator

@hvanz hvanz commented May 30, 2023

Addresses tendermint/tendermint#5519

Also, this PR simplifies the implementation of the recheck callback by eliminating the global pointers recheckCursor and recheckEnd, which were used to traverse the list of transactions while processing recheck requests. These variables were also used in other parts of the code for deciding if there were still unprocessed recheck requests. This PR replaces them by the existing txsMap, which maps tx keys to txs entries, and by a new counter variable to keep track of how many requests are still to be processed.

Before it was assumed that re-CheckTx requests were processed and handled sequentially, but this is not true for gRPC ABCI clients (see comments in tendermint/tendermint#5519).

Built on top of #894


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments

@hvanz hvanz added mempool wip Work in progress labels May 30, 2023
@hvanz hvanz self-assigned this May 30, 2023
@otrack
Copy link
Contributor

otrack commented May 31, 2023

Handling in order ABCI calls at the client is mandatory. If not then a FlushAppConn operation may return earlier than a prior resCbFirstTime or resCbRecheck call. It follows that Update may concurrently execute with such a (re-)check operation. This breaks some base invariants in clist_mempool.go and/or may raise exceptions (e.g., removing twice the same element in CList.)

In my view, the gRPC ABCI client does provide the desired in-order semantics.

I would not add more concurrency to CListMempool, because it seems that there is no need for it performance wise.
In fact, I would do the opposite, that is additional locking, typically in the callbacks. In its current state, this class works because it relies heavily on (partly missing) assumptions wrt. consensus and the client application.

Nevertheless, the idea of removing the cursor is I believe a nice one because it does simplify the codebase :)

@hvanz hvanz added the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Jun 5, 2023
@hvanz hvanz marked this pull request as ready for review June 5, 2023 17:43
@hvanz hvanz requested a review from a team as a code owner June 5, 2023 17:43
@hvanz hvanz removed the wip Work in progress label Jun 5, 2023
@hvanz
Copy link
Collaborator Author

hvanz commented Jun 5, 2023

It seems that the ABCI Flush method is not relevant at all (see tendermint/tendermint#6994). While simplifying the client interface for v0.36, it was proposed to be removed (see comments in tendermint/tendermint#7607), which finally never happened. The client interface for ABCI has a comment saying that this method should be removed as it is not implemented. And even the SDK does not implement a handler for Flush requests.

@sergio-mena
Copy link
Collaborator

I didn't go thoroughly through the patch, but have some comments:

  • I feel that we shouldn't backport this to v0.38.x: the QA has finished so this is not covered by those tests (it will be as part of v0.39.x QA). This changes are far from trivial and change concurrency in the mempool: an extra reason to not backport it.
  • Some info on some of the comments above
    • the "official" reason for having the Flush ABCI call is in this section of the spec. Now, I am aware that the semantics we provide for each of the clients (socket, gRPC, local) is currently different (actually, inconsistent). But I still see Flush makes sense in the context of socket and gRPC. We can discuss this synchronously if you think it'd help
    • The SDK does not implement Flush probably because it is irrelevant for a local client (SDK uses the local client, whereby ABCI calls become function calls, so you don't need to flush). It's a different story for remote clients (gRPC and socket)

@hvanz hvanz removed the backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x label Jun 8, 2023
@hvanz
Copy link
Collaborator Author

hvanz commented Jun 8, 2023

I feel that we shouldn't backport this to v0.38.x:

You're right. I removed the label.

@hvanz
Copy link
Collaborator Author

hvanz commented Jun 8, 2023

  • the "official" reason for having the Flush ABCI call is in this section of the spec.

This is what the spec says:

Before invoking Commit, CometBFT locks the mempool and flushes the mempool connection. This ensures that no new messages will be received on the mempool connection during this processing step, providing an opportunity to safely update all four connection states to the latest committed state at the same time.

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

@otrack
Copy link
Contributor

otrack commented Jun 8, 2023

  • the "official" reason for having the Flush ABCI call is in this section of the spec.

This is what the spec says:

Before invoking Commit, CometBFT locks the mempool and flushes the mempool connection. This ensures that no new messages will be received on the mempool connection during this processing step, providing an opportunity to safely update all four connection states to the latest committed state at the same time.

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

I believe that this is needed, otherwise we might end up with two concurrent remove calls of the same transaction on the txs field, leading to a panic. I will add a test based on your PR which exercices this.

To avoid the above problem, I am advocating to take locks in the callbacks. However, this poses a problem because the local client is hanging: the local client is synchronous and locks are not re-entrant in go. We should change this pattern because this client is not doing what is suppose to. However, as the SDK is using it, I wonder how tractable such a change is.

@otrack
Copy link
Contributor

otrack commented Jun 8, 2023

A bad interleaving that explains why we need flush (at least for the moment) is here.

@otrack
Copy link
Contributor

otrack commented Jun 9, 2023

Just a few more thoughts about improving the parallelism by allowing concurrent checkTx.

It seems that there is no easy way to do this. The knot of the problem is this strange pattern in the mempool where one passes a callback to another callback.

Maybe the best approach would be to have a re-entrant read/write lock for the CListMempool. Another idea is to ask AppConnMempool::CheckTxAsync to take as a parameter the callback to invoke once the application answers.

All in all, I am not sure this is a priority, given the speed of the current implementation: around 60us to check a Tx with the remote client.

@sergio-mena
Copy link
Collaborator

I would say that only the locking part is needed to ensure that no new messages will be received. Is this correct?

This has never been thoroughly discussed. Maybe we can include it in today's meeting?

@hvanz
Copy link
Collaborator Author

hvanz commented Jun 20, 2023

It seems that there is no easy way to do this. The knot of the problem is this strange pattern in the mempool where one passes a callback to another callback.

@otrack Here's an idea to untangle this knot. The callback resCbFirstTime is implemented this way because it needs to record the sender when the transaction is added to the mempool. Each transaction is stored in a mempoolTx together with its senders. We can take out the senders of mempoolTx (and therefore out of txs) and store them in a new map txSenders from txKey to the list of senders. This map will need to be kept consistent with the transactions in the mempool. Then we could call resCbFirstTime from globalCb, and take the senders from txSenders when we need to add the transaction to the mempool. See #1010 and #1032.

@hvanz hvanz marked this pull request as draft June 27, 2023 08:01
@hvanz hvanz added the wip Work in progress label Jun 27, 2023
@melekes
Copy link
Collaborator

melekes commented Jan 31, 2024

@hvanz very interesting 👍 any plans to resume this?

@hvanz
Copy link
Collaborator Author

hvanz commented Apr 26, 2024

Closing as it is currently not an option to break the FIFO ordering when checking transactions.

@hvanz hvanz closed this Apr 26, 2024
@hvanz hvanz removed the wip Work in progress label Apr 26, 2024
@hvanz hvanz deleted the hernan/mempool-concurrent-rechecks branch July 11, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

No open projects
Status: Done
Status: Ready for Review

Development

Successfully merging this pull request may close these issues.

5 participants