Skip to content

fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268)#3020

Merged
hvanz merged 4 commits intov0.38.xfrom
mergify/bp/v0.38.x/pr-2268
May 7, 2024
Merged

fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268)#3020
hvanz merged 4 commits intov0.38.xfrom
mergify/bp/v0.38.x/pr-2268

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented May 7, 2024

Fixes #2225 and #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the CheckTxAsync request for the last transaction fails, then the resCbRecheck callback on the response is not called, and the recheck variables end up in a wrong state (recheckCursor != nil, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the CheckTx response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where CheckTx responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this.

  1. When we call CheckTxAsync, block waiting for a response. If the response never arrives, it will block Update forever.
  2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and recheckCursor will always be nil.

This PR also:

  • refactors all recheck logic to put it into a separate recheck struct. The fix to the bug described above is the only change in the recheck logic.
  • adds 4 new tests.

PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com).

…2268)

Fixes ~~#2225 and~~ #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Daniel <daniel.cason@informal.systems>
(cherry picked from commit f3775f4)

# Conflicts:
#	.changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md
#	config/toml.go
#	docs/references/config/config.toml.md
#	mempool/clist_mempool.go
#	mempool/clist_mempool_test.go
@mergify
Copy link
Contributor Author

mergify bot commented May 7, 2024

Cherry-pick of f3775f4 has failed:

On branch mergify/bp/v0.38.x/pr-2268
Your branch is up to date with 'origin/v0.38.x'.

You are currently cherry-picking commit f3775f4bb.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   .changelog/unreleased/improvements/1827-config-mempool-recheck-timeout.md
	modified:   config/config.go

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	added by them:   .changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md
	both modified:   config/toml.go
	deleted by us:   docs/references/config/config.toml.md
	both modified:   mempool/clist_mempool.go
	both modified:   mempool/clist_mempool_test.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot requested a review from a team as a code owner May 7, 2024 06:41
@mergify mergify bot added the conflicts label May 7, 2024
@hvanz hvanz merged commit 9ccdb9b into v0.38.x May 7, 2024
@hvanz hvanz deleted the mergify/bp/v0.38.x/pr-2268 branch May 7, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants