fix(mempool): Fix data race when rechecking with async ABCI client#2268
fix(mempool): Fix data race when rechecking with async ABCI client#2268
Conversation
22ec65a to
d9af5d0
Compare
|
thank you |
cason
left a comment
There was a problem hiding this comment.
Left some comments, some are just me thinking aloud.
I am not sure that forcing a maximum delay for ReCheckTx to return is the right approach here. Is is possible to come up with a different (asynchronous) solution?
| } | ||
|
|
||
| // This test used to cause a data race when rechecking (see https://github.com/cometbft/cometbft/issues/1827). | ||
| func TestMempoolRecheckRace(t *testing.T) { |
There was a problem hiding this comment.
This test passes with the original code (in main).
There was a problem hiding this comment.
I think it should break there, and pass here.
| } | ||
|
|
||
| // recheckTxs sends all transactions in the mempool to the app for re-validation. When the function | ||
| // returns, all recheck responses from the app have been processed. |
There was a problem hiding this comment.
This is not about this PR, but blocking here is terrible. Why we need to conclude the recheck before move on? And, if this is indeed the case, why we use CheckTxAsync, expected to be... asynchronous.
There was a problem hiding this comment.
It's not blocking, that's why there is a timeout, so that rechecking can finish even when not all the CheckTx responses have arrived.
Remember that recheck happens after the mempool has been updated. And once rechecking finishes then the updateMtx lock is released and adding new transactions via CListMempool.CheckTx is possible again.
There was a problem hiding this comment.
Sorry, ignore above when I say it's non-blocking. Rechecking needs to block waiting for all reCheckTx responses because it needs to finish before new transactions are allowed to be checked again. This is to not break the sequential order when rechecking transactions.
There was a problem hiding this comment.
Bear in mind that, even if we wait here until all TXs have been re-Checked, it still makes sense that CheckTx is async: in non-local ABCI clients we will pipeline this:
- Request serialization, request transmission (over TCP or gRPC), Request deserialization at the app, etc
If we makeCheckTxsynchronous, all that will have to happen for one TX until we start with the next. This may make no difference with local client, but is likely to have a high performance impact over a socket or gRPC
|
|
||
| mem.recheckCursor = mem.txs.Front() | ||
| mem.recheckEnd = mem.txs.Back() | ||
| mem.recheck.init(mem.txs.Front(), mem.txs.Back()) |
There was a problem hiding this comment.
Sorry, I forgot why we need to keep track of mem.txs.Back(). I imagine it's to avoid re-checking newly received TXs while running recheckTxs().
However, haven't we locked the mempool so that this doesn't happen?
There was a problem hiding this comment.
This is part of the original recheck logic and it's to check if recheck.cursor reached the end of the list, that is, when recheck.cursor == recheck.end. You're right that the mempool is locked, so the list doesn't change during one rechecking process, but I think it's better to store the last value in the Recheck struct so the whole recheck logic is isolated.
…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4)
…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4) # Conflicts: # .changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md # config/toml.go # docs/references/config/config.toml.md # mempool/clist_mempool.go # mempool/clist_mempool_test.go
…ackport #2268) (#3019) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
…ackport #2268) (#3020) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>
Fixes
#2225 and#1827(#2225 is now fixed in a separate PR, #2894)
The bug: during rechecking, when the
CheckTxAsyncrequest for the last transaction fails, then theresCbRecheckcallback on the response is not called, and the recheck variables end up in a wrong state (recheckCursor != nil, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and theCheckTxresponse finds that rechecking hasn't finished.This problem only happens when using the non-local ABCI client, where
CheckTxresponses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this.CheckTxAsync, block waiting for a response. If the response never arrives, it will blockUpdateforever.This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response.
With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and
recheckCursorwill always benil.This PR also:
recheckstruct. The fix to the bug described above is the only change in the recheck logic.PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code comments