Skip to content

fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268)#3019

Merged
hvanz merged 1 commit intov1.xfrom
mergify/bp/v1.x/pr-2268
May 7, 2024
Merged

fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268)#3019
hvanz merged 1 commit intov1.xfrom
mergify/bp/v1.x/pr-2268

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented May 7, 2024

Fixes #2225 and #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the CheckTxAsync request for the last transaction fails, then the resCbRecheck callback on the response is not called, and the recheck variables end up in a wrong state (recheckCursor != nil, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the CheckTx response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where CheckTx responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this.

  1. When we call CheckTxAsync, block waiting for a response. If the response never arrives, it will block Update forever.
  2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and recheckCursor will always be nil.

This PR also:

  • refactors all recheck logic to put it into a separate recheck struct. The fix to the bug described above is the only change in the recheck logic.
  • adds 4 new tests.

PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com).

…2268)

Fixes ~~#2225 and~~ #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Daniel <daniel.cason@informal.systems>
(cherry picked from commit f3775f4)
@mergify mergify bot requested a review from a team as a code owner May 7, 2024 06:41
@mergify mergify bot requested a review from a team May 7, 2024 06:41
@hvanz hvanz merged commit 75cba12 into v1.x May 7, 2024
@hvanz hvanz deleted the mergify/bp/v1.x/pr-2268 branch May 7, 2024 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant