perf: TxSearch pagination by czarcas7ic · Pull Request #2855 · cometbft/cometbft

czarcas7ic · 2024-04-20T23:24:29Z

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:

After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.

After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction.

We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments
Title follows the Conventional Commits spec

czarcas7ic · 2024-04-20T23:28:03Z

I don't have perms to add labels, but I would imagine this would be a backport to 1, .38, and .37

melekes

Thanks @czarcas7ic ❤️

state/txindex/kv/kv.go

melekes · 2024-04-21T10:17:27Z

I am not sure if we can backport this to v0.37 and v0.38 since it's API-breaking.

czarcas7ic · 2024-04-21T15:13:39Z

@melekes no worries, we have it in our fork so it won't effect us.

Just should be cautious / educate chains on non forks of these versions serving public infra as it is a (low threat) DOS vector. We have pretty heavy rate limiting in place, but since this can be done via a single query the rate limiting doesn't help. Also important to note our infra is on some pretty beefy machines.

CHANGELOG.md

cason

Legit.

I didn't test if the resulting responses work as expected, and how the pagination works, but this part looks more like a refactoring.

state/txindex/indexer.go

rpc/core/tx.go

rpc/core/types/responses.go

state/txindex/indexer.go

rpc/core/tx.go

…agination

rpc/core/tx.go

melekes · 2024-04-26T06:39:54Z

rpc/core/blocks.go:224:7: string `desc` has 3 occurrences, make it a constant (goconst)
	case "desc", "":

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing: ![Screenshot 2024-04-20 at 11 25 36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69) After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data. ![Screenshot 2024-04-20 at 11 28 29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669) After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far! ![Screenshot 2024-04-20 at 11 33 11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53) #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit b420f07) # Conflicts: # rpc/core/tx.go

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing: ![Screenshot 2024-04-20 at 11 25 36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69) After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data. ![Screenshot 2024-04-20 at 11 28 29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669) After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far! ![Screenshot 2024-04-20 at 11 33 11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53) #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing: ![Screenshot 2024-04-20 at 11 25 36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69) After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data. ![Screenshot 2024-04-20 at 11 28 29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669) After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far! ![Screenshot 2024-04-20 at 11 33 11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53) - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing: ![Screenshot 2024-04-20 at 11 25 36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69) After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data. ![Screenshot 2024-04-20 at 11 28 29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669) After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far! ![Screenshot 2024-04-20 at 11 33 11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53) - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

See #2855 or #2910 for a detailed description --- #### PR checklist - ~[ ] Tests written/updated~ - ~[ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog)~ - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing: ![Screenshot 2024-04-20 at 11 25 36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69) After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data. ![Screenshot 2024-04-20 at 11 28 29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669) After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far! ![Screenshot 2024-04-20 at 11 33 11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53) - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing: ![Screenshot 2024-04-20 at 11 25 36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69) After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data. ![Screenshot 2024-04-20 at 11 28 29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669) After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far! ![Screenshot 2024-04-20 at 11 33 11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53) - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

txs pagination performance improvements

5d5b12d

czarcas7ic changed the title ~~txs pagination performance improvements~~ perf: TxSearch pagination Apr 20, 2024

czarcas7ic marked this pull request as ready for review April 20, 2024 23:26

czarcas7ic requested a review from a team as a code owner April 20, 2024 23:26

czarcas7ic requested a review from a team April 20, 2024 23:26

add changelog entry

6420d01

melekes added backport-to-v0.37.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Apr 21, 2024

melekes reviewed Apr 21, 2024

View reviewed changes

state/txindex/kv/kv.go Outdated Show resolved Hide resolved

return error instead of panic

0e7ac0a

melekes removed backport-to-v0.37.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Apr 22, 2024

melekes approved these changes Apr 22, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

czarcas7ic and others added 2 commits April 22, 2024 09:04

changelog

140bb5c

Merge branch 'main' into adam/upstream-txs-pagination

f5bd64d

adizere added this to the 2024-Q2 milestone Apr 23, 2024

cason approved these changes Apr 23, 2024

View reviewed changes

state/txindex/indexer.go Outdated Show resolved Hide resolved

rpc/core/tx.go Outdated Show resolved Hide resolved

rpc/core/types/responses.go Outdated Show resolved Hide resolved

state/txindex/indexer.go Outdated Show resolved Hide resolved

czarcas7ic and others added 3 commits April 23, 2024 14:49

move Pagination type to txindex package

ee94683

address lints

eddf357

Merge branch 'main' into adam/upstream-txs-pagination

c06b935

melekes reviewed Apr 24, 2024

View reviewed changes

rpc/core/tx.go Show resolved Hide resolved

czarcas7ic added 2 commits April 24, 2024 10:24

add orderBy check

b7946c4

Merge remote-tracking branch 'upstream/main' into adam/upstream-txs-p…

d4ed02c

…agination

melekes reviewed Apr 25, 2024

View reviewed changes

rpc/core/tx.go Outdated Show resolved Hide resolved

melekes enabled auto-merge April 26, 2024 05:57

lint

cd46b81

auto-merge was automatically disabled April 26, 2024 15:18
Head branch was pushed to by a user without write access

melekes enabled auto-merge April 27, 2024 07:18

melekes added this pull request to the merge queue Apr 27, 2024

Merged via the queue into cometbft:main with commit b420f07 Apr 27, 2024

mergify bot mentioned this pull request Apr 27, 2024

perf: TxSearch pagination (backport #2855) #2910

Merged

4 tasks

sergio-mena mentioned this pull request Jul 25, 2024

perf(txindex): search optimization (backport #3458) #3552

Merged

4 tasks

sergio-mena mentioned this pull request Jul 25, 2024

perf: TxSearch pagination (manual backport #2910) #3556

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: TxSearch pagination#2855

perf: TxSearch pagination#2855
melekes merged 11 commits intocometbft:mainfrom
osmosis-labs:adam/upstream-txs-pagination

czarcas7ic commented Apr 20, 2024 •

edited

Loading

Uh oh!

czarcas7ic commented Apr 20, 2024

Uh oh!

melekes left a comment

Uh oh!

Uh oh!

melekes commented Apr 21, 2024

Uh oh!

czarcas7ic commented Apr 21, 2024 •

edited

Loading

Uh oh!

Uh oh!

cason left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melekes commented Apr 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

czarcas7ic commented Apr 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

czarcas7ic commented Apr 20, 2024

Uh oh!

melekes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

melekes commented Apr 21, 2024

Uh oh!

czarcas7ic commented Apr 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cason left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melekes commented Apr 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

czarcas7ic commented Apr 20, 2024 •

edited

Loading

czarcas7ic commented Apr 21, 2024 •

edited

Loading