Conversation
|
I don't have perms to add labels, but I would imagine this would be a backport to 1, .38, and .37 |
|
I am not sure if we can backport this to v0.37 and v0.38 since it's API-breaking. |
|
@melekes no worries, we have it in our fork so it won't effect us. Just should be cautious / educate chains on non forks of these versions serving public infra as it is a (low threat) DOS vector. We have pretty heavy rate limiting in place, but since this can be done via a single query the rate limiting doesn't help. Also important to note our infra is on some pretty beefy machines. |
cason
left a comment
There was a problem hiding this comment.
Legit.
I didn't test if the resulting responses work as expected, and how the pagination works, but this part looks more like a refactoring.
|
Head branch was pushed to by a user without write access
<!-- Please add a reference to the issue that this PR addresses and indicate which files are most critical to review. If it fully addresses a particular issue, please include "Closes #XXX" (where "XXX" is the issue number). If this PR is non-trivial/large/complex, please ensure that you have either created an issue that the team's had a chance to respond to, or had some discussion with the team prior to submitting substantial pull requests. The team can be reached via GitHub Discussions or the Cosmos Network Discord server in the #cometbft channel. GitHub Discussions is preferred over Discord as it allows us to keep track of conversations topically. https://github.com/cometbft/cometbft/discussions If the work in this PR is not aligned with the team's current priorities, please be advised that it may take some time before it is merged - especially if it has not yet been discussed with the team. See the project board for the team's current priorities: https://github.com/orgs/cometbft/projects/1 --> Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit b420f07) # Conflicts: # rpc/core/tx.go
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
See #2855 or #2910 for a detailed description --- #### PR checklist - ~[ ] Tests written/updated~ - ~[ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog)~ - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
<!-- Please add a reference to the issue that this PR addresses and indicate which files are most critical to review. If it fully addresses a particular issue, please include "Closes #XXX" (where "XXX" is the issue number). If this PR is non-trivial/large/complex, please ensure that you have either created an issue that the team's had a chance to respond to, or had some discussion with the team prior to submitting substantial pull requests. The team can be reached via GitHub Discussions or the Cosmos Network Discord server in the #cometbft channel. GitHub Discussions is preferred over Discord as it allows us to keep track of conversations topically. https://github.com/cometbft/cometbft/discussions If the work in this PR is not aligned with the team's current priorities, please be advised that it may take some time before it is merged - especially if it has not yet been discussed with the team. See the project board for the team's current priorities: https://github.com/orgs/cometbft/projects/1 --> Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:  After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.  After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction. We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested. We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!  - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2855 done by [Mergify](https://mergify.com). --------- Co-authored-by: Adam Tucker <adam@osmosis.team> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:
After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.
After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction.
We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested.
We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!
PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code comments