perf: TxSearch pagination (backport #2855)#2910
Merged
Conversation
<!-- Please add a reference to the issue that this PR addresses and indicate which files are most critical to review. If it fully addresses a particular issue, please include "Closes #XXX" (where "XXX" is the issue number). If this PR is non-trivial/large/complex, please ensure that you have either created an issue that the team's had a chance to respond to, or had some discussion with the team prior to submitting substantial pull requests. The team can be reached via GitHub Discussions or the Cosmos Network Discord server in the #cometbft channel. GitHub Discussions is preferred over Discord as it allows us to keep track of conversations topically. https://github.com/cometbft/cometbft/discussions If the work in this PR is not aligned with the team's current priorities, please be advised that it may take some time before it is merged - especially if it has not yet been discussed with the team. See the project board for the team's current priorities: https://github.com/orgs/cometbft/projects/1 --> Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit b420f07) # Conflicts: # rpc/core/tx.go
Contributor
Author
|
Cherry-pick of b420f07 has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
melekes
approved these changes
Apr 27, 2024
czarcas7ic
added a commit
to osmosis-labs/cometbft
that referenced
this pull request
May 10, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
sergio-mena
pushed a commit
that referenced
this pull request
Jul 25, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
1 task
sergio-mena
added a commit
that referenced
this pull request
Jul 25, 2024
See #2855 or #2910 for a detailed description --- #### PR checklist - ~[ ] Tests written/updated~ - ~[ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog)~ - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
PaddyMc
pushed a commit
to osmosis-labs/cometbft
that referenced
this pull request
Aug 19, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:
After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.
After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction.
We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested.
We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!
PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code commentsThis is an automatic backport of pull request #2855 done by [Mergify](https://mergify.com).