Skip to content

perf: TxSearch pagination#2855

Merged
melekes merged 11 commits intocometbft:mainfrom
osmosis-labs:adam/upstream-txs-pagination
Apr 27, 2024
Merged

perf: TxSearch pagination#2855
melekes merged 11 commits intocometbft:mainfrom
osmosis-labs:adam/upstream-txs-pagination

Conversation

@czarcas7ic
Copy link
Contributor

@czarcas7ic czarcas7ic commented Apr 20, 2024

Since moving to faster blocks, Osmosis public RPC nodes have noticed massive RAM spikes, resulting in nodes constantly crashing:

Screenshot 2024-04-20 at 11 25 36 AM

After heap profiling, the issue was clearly coming from TxSearch, showing that it was unmarshaling a huge amount of data.

Screenshot 2024-04-20 at 11 28 29 AM

After looking into the method, the issue is that txSearch retrieves all hashes (filtered by the query condition), but we call Get (and therefore unmarshal) every filtered transaction from the transaction index store, regaurdless whether or not the transactions are within the pagination request. Therefore, if one were to call txSearch on an event that happens on almost every transaction, this causes the node to unmarshal essentially every transaction.

We have all the data we need in the key though to sort the transaction hashes without unmarshaling the transactions at all! This PR filters and sorts the hashes, paginates them, and then only retrieves the transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero spikes on the patched ones thus far!

Screenshot 2024-04-20 at 11 33 11 AM

PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

@czarcas7ic czarcas7ic changed the title txs pagination performance improvements perf: TxSearch pagination Apr 20, 2024
@czarcas7ic czarcas7ic marked this pull request as ready for review April 20, 2024 23:26
@czarcas7ic czarcas7ic requested a review from a team as a code owner April 20, 2024 23:26
@czarcas7ic czarcas7ic requested a review from a team April 20, 2024 23:26
@czarcas7ic
Copy link
Contributor Author

I don't have perms to add labels, but I would imagine this would be a backport to 1, .38, and .37

@melekes melekes added backport-to-v0.37.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Apr 21, 2024
Copy link
Collaborator

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @czarcas7ic ❤️

@melekes
Copy link
Collaborator

melekes commented Apr 21, 2024

I am not sure if we can backport this to v0.37 and v0.38 since it's API-breaking.

@czarcas7ic
Copy link
Contributor Author

czarcas7ic commented Apr 21, 2024

@melekes no worries, we have it in our fork so it won't effect us.

Just should be cautious / educate chains on non forks of these versions serving public infra as it is a (low threat) DOS vector. We have pretty heavy rate limiting in place, but since this can be done via a single query the rate limiting doesn't help. Also important to note our infra is on some pretty beefy machines.

@melekes melekes removed backport-to-v0.37.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Apr 22, 2024
@adizere adizere added this to the 2024-Q2 milestone Apr 23, 2024
Copy link

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legit.

I didn't test if the resulting responses work as expected, and how the pagination works, but this part looks more like a refactoring.

@melekes melekes enabled auto-merge April 26, 2024 05:57
@melekes
Copy link
Collaborator

melekes commented Apr 26, 2024

rpc/core/blocks.go:224:7: string `desc` has 3 occurrences, make it a constant (goconst)
	case "desc", "":

auto-merge was automatically disabled April 26, 2024 15:18

Head branch was pushed to by a user without write access

@melekes melekes enabled auto-merge April 27, 2024 07:18
@melekes melekes added this pull request to the merge queue Apr 27, 2024
Merged via the queue into cometbft:main with commit b420f07 Apr 27, 2024
mergify bot pushed a commit that referenced this pull request Apr 27, 2024
<!--

Please add a reference to the issue that this PR addresses and indicate
which
files are most critical to review. If it fully addresses a particular
issue,
please include "Closes #XXX" (where "XXX" is the issue number).

If this PR is non-trivial/large/complex, please ensure that you have
either
created an issue that the team's had a chance to respond to, or had some
discussion with the team prior to submitting substantial pull requests.
The team
can be reached via GitHub Discussions or the Cosmos Network Discord
server in
the #cometbft channel. GitHub Discussions is preferred over Discord as
it
allows us to keep track of conversations topically.
https://github.com/cometbft/cometbft/discussions

If the work in this PR is not aligned with the team's current
priorities, please
be advised that it may take some time before it is merged - especially
if it has
not yet been discussed with the team.

See the project board for the team's current priorities:
https://github.com/orgs/cometbft/projects/1

-->

Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit b420f07)

# Conflicts:
#	rpc/core/tx.go
melekes added a commit that referenced this pull request Apr 27, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2855 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
czarcas7ic added a commit to osmosis-labs/cometbft that referenced this pull request May 10, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#2855 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
sergio-mena pushed a commit that referenced this pull request Jul 25, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2855 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
sergio-mena added a commit that referenced this pull request Jul 25, 2024
See #2855 or #2910 for a detailed description

---

#### PR checklist

- ~[ ] Tests written/updated~
- ~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~
- ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
inon-man pushed a commit to classic-terra/cometbft that referenced this pull request Jul 31, 2024
<!--

Please add a reference to the issue that this PR addresses and indicate
which
files are most critical to review. If it fully addresses a particular
issue,
please include "Closes #XXX" (where "XXX" is the issue number).

If this PR is non-trivial/large/complex, please ensure that you have
either
created an issue that the team's had a chance to respond to, or had some
discussion with the team prior to submitting substantial pull requests.
The team
can be reached via GitHub Discussions or the Cosmos Network Discord
server in
the #cometbft channel. GitHub Discussions is preferred over Discord as
it
allows us to keep track of conversations topically.
https://github.com/cometbft/cometbft/discussions

If the work in this PR is not aligned with the team's current
priorities, please
be advised that it may take some time before it is merged - especially
if it has
not yet been discussed with the team.

See the project board for the team's current priorities:
https://github.com/orgs/cometbft/projects/1

-->

Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
PaddyMc pushed a commit to osmosis-labs/cometbft that referenced this pull request Aug 19, 2024
Since moving to faster blocks, Osmosis public RPC nodes have noticed
massive RAM spikes, resulting in nodes constantly crashing:

![Screenshot 2024-04-20 at 11 25
36 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/18d0513e-25fc-4510-b4bd-b48472a9df69)

After heap profiling, the issue was clearly coming from TxSearch,
showing that it was unmarshaling a huge amount of data.

![Screenshot 2024-04-20 at 11 28
29 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/5d88a66a-c72d-4752-8770-a2c00e6d7669)

After looking into the method, the issue is that txSearch retrieves all
hashes (filtered by the query condition), but we call Get (and therefore
unmarshal) every filtered transaction from the transaction index store,
regaurdless whether or not the transactions are within the pagination
request. Therefore, if one were to call txSearch on an event that
happens on almost every transaction, this causes the node to unmarshal
essentially every transaction.

We have all the data we need in the key though to sort the transaction
hashes without unmarshaling the transactions at all! This PR filters and
sorts the hashes, paginates them, and then only retrieves the
transactions that fall in the page being requested.

We have run this patch on two of our RPC nodes, and have seen zero
spikes on the patched ones thus far!

![Screenshot 2024-04-20 at 11 33
11 AM](https://github.com/osmosis-labs/cometbft/assets/40078083/fd815f81-5756-45bd-b1c0-818e6774ea53)

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#2855 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Adam Tucker <adam@osmosis.team>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants