Skip to content

perf: TxSearch pagination (backport #2855)#2910

Merged
melekes merged 2 commits intov1.xfrom
mergify/bp/v1.x/pr-2855
Apr 27, 2024
Merged

perf: TxSearch pagination (backport #2855)#2910
melekes merged 2 commits intov1.xfrom
mergify/bp/v1.x/pr-2855

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Apr 27, 2024

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:

Screenshot 2024-04-20 at 11 25 36 AM

After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.

Screenshot 2024-04-20 at 11 28 29 AM

After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction.

We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!

Screenshot 2024-04-20 at 11 33 11 AM

PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

This is an automatic backport of pull request #2855 done by [Mergify](https://mergify.com).

<!--

Please add a reference to the issue that this PR addresses and indicate
which
files are most critical to review. If it fully addresses a particular
issue,
please include "Closes #XXX" (where "XXX" is the issue number).

If this PR is non-trivial/large/complex, please ensure that you have
either
created an issue that the team's had a chance to respond to, or had some
discussion with the team prior to submitting substantial pull requests.
The team
can be reached via GitHub Discussions or the Cosmos Network Discord
server in
the #cometbft channel. GitHub Discussions is preferred over Discord as
it
allows us to keep track of conversations topically.
https://github.com/cometbft/cometbft/discussions

If the work in this PR is not aligned with the team's current
priorities, please
be advised that it may take some time before it is merged - especially
if it has
not yet been discussed with the team.

See the project board for the team's current priorities:
https://github.com/orgs/cometbft/projects/1

-->

Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit b420f07)

# Conflicts:
#	rpc/core/tx.go
@mergify mergify bot requested a review from a team as a code owner April 27, 2024 07:25
@mergify mergify bot requested a review from a team April 27, 2024 07:25
@mergify mergify bot added the conflicts label Apr 27, 2024
@mergify
Copy link
Contributor Author

mergify bot commented Apr 27, 2024

Cherry-pick of b420f07 has failed:

On branch mergify/bp/v1.x/pr-2855
Your branch is up to date with 'origin/v1.x'.

You are currently cherry-picking commit b420f0765.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   .changelog/unreleased/improvements/2855-fix-txsearch-performance.md
	modified:   internal/inspect/inspect_test.go
	modified:   rpc/core/blocks.go
	modified:   state/indexer/sink/psql/backport.go
	modified:   state/pruner_test.go
	modified:   state/txindex/indexer.go
	modified:   state/txindex/kv/kv.go
	modified:   state/txindex/kv/kv_bench_test.go
	modified:   state/txindex/kv/kv_test.go
	modified:   state/txindex/mocks/tx_indexer.go
	modified:   state/txindex/null/null.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   rpc/core/tx.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@melekes melekes merged commit b2db3ed into v1.x Apr 27, 2024
@melekes melekes deleted the mergify/bp/v1.x/pr-2855 branch April 27, 2024 07:58
czarcas7ic added a commit to osmosis-labs/cometbft that referenced this pull request May 10, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#2855 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
sergio-mena pushed a commit that referenced this pull request Jul 25, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2855 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
sergio-mena added a commit that referenced this pull request Jul 25, 2024
See #2855 or #2910 for a detailed description

---

#### PR checklist

- ~[ ] Tests written/updated~
- ~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~
- ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
PaddyMc pushed a commit to osmosis-labs/cometbft that referenced this pull request Aug 19, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#2855 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants