fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268)#3019
Merged
fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268)#3019
Conversation
…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4)
hvanz
approved these changes
May 7, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes
#2225 and#1827(#2225 is now fixed in a separate PR, #2894)
The bug: during rechecking, when the
CheckTxAsyncrequest for the last transaction fails, then theresCbRecheckcallback on the response is not called, and the recheck variables end up in a wrong state (recheckCursor != nil, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and theCheckTxresponse finds that rechecking hasn't finished.This problem only happens when using the non-local ABCI client, where
CheckTxresponses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this.CheckTxAsync, block waiting for a response. If the response never arrives, it will blockUpdateforever.This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response.
With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and
recheckCursorwill always benil.This PR also:
recheckstruct. The fix to the bug described above is the only change in the recheck logic.PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code commentsThis is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com).