Skip to content

perf(p2p): Reduce the p2p metrics overhead.#3411

Merged
melekes merged 11 commits intomainfrom
dev/reduce_p2p_metric_overhead
Jul 26, 2024
Merged

perf(p2p): Reduce the p2p metrics overhead.#3411
melekes merged 11 commits intomainfrom
dev/reduce_p2p_metric_overhead

Conversation

@ValarDragon
Copy link
Contributor

@ValarDragon ValarDragon commented Jul 3, 2024

Closes #2840

We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine.

This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send.

By napkin estimate, this also saves 8% of newObject call time, which hopefully also means over 8% of the GC overhead times


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

We do this, by the calls to Prometheus (which have many allocations) to be
significantly batched, and importantly not blocking the send routine and recv routine.

This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send
@ValarDragon ValarDragon requested a review from a team as a code owner July 3, 2024 12:42
@ValarDragon ValarDragon requested a review from a team July 3, 2024 12:42
@ValarDragon
Copy link
Contributor Author

ValarDragon commented Jul 3, 2024

This was a 60% reduction to GC time 👀 from one benchmark, so this actually sped up the net CPU profile time by 30% for Osmosis based on two 1 hour profiles.

(Note Osmosis GC has many other terms eliminated, e.g. debug logs, and my other open PR's. On main Comet branches, GC will be comet debug log dominated)

EDIT: I am worried there is noise in the GC reduction (e.g. something else also got lower in gossip as well, since this seems too good on more inspection) It seems like my reference for old GC time may have been baseline much slower than normal blocks.

@cason
Copy link

cason commented Jul 5, 2024

How did you get this performance improvement numbers? Because they are impressive.

I probably need a way to test connections in isolation to experimentally check the impact of this changes.

@cason
Copy link

cason commented Jul 5, 2024

A second point to made, that worries from a while, is the fact that metric caches are private types. This is not exactly a problem, as they are internal to this package, but the problem is that they are provided in some constructors (e.g., peerConfig), rendering impossible to use Public types in the p2p package because it is private.

In summary, is there any way to render the metrics cache types part of the public p2p.Metrics type?

Copy link

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comments regarding the implementation.

For me the only blocker is that we are updating these stats every 10 seconds, instead of every send/receive call. Should we reconsider this granularity?

@melekes
Copy link
Collaborator

melekes commented Jul 5, 2024

A more general comment is: how confident are we that we need these metrics? If they are costing us so much CPU.

ValarDragon and others added 2 commits July 5, 2024 22:49
Co-authored-by: Daniel <daniel.cason@informal.systems>
Co-authored-by: Daniel <daniel.cason@informal.systems>
@ValarDragon
Copy link
Contributor Author

I actually found them really helpful for debugging bandwidth. The metric is still a cost after this pr, but I need to reprofile after casons point that I'm copying the struct here

@cason cason added the wip Work in progress label Jul 8, 2024
@melekes melekes added p2p and removed wip Work in progress labels Jul 9, 2024
@melekes
Copy link
Collaborator

melekes commented Jul 9, 2024

@ValarDragon I've removed metricsLabelCache because I've realized it's now unused. I.e., metric labels are not shared between peer caches because you've removed the ValueToMetricLabel call from peer. Could you get the profile with the latest changes and confirm that the overhead is gone? Thanks 🙏

@melekes melekes added the needs-information Waiting for additional information or feedback label Jul 9, 2024
@melekes melekes requested a review from cason July 9, 2024 08:33
Copy link

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I loved removing this mlc private parameter for Peer and other (public) components.

The only part I don't agree is to report metrics only every 10s:

const metricsTickerDuration = 10 * time.Second

While now we report then on every send/receive method.

Should we reduce it to, say, 1s, 0.5s?

@melekes

This comment was marked as resolved.

@cason
Copy link

cason commented Jul 10, 2024

The only part I don't agree is to report metrics only every 10s:

why? Prometheus default scrape interval is 1m https://prometheus.io/docs/prometheus/latest/configuration/configuration/

If you say so, great. Lets merge it.

Then, can we somehow compare the Prometheus output with and without this PR?

@ValarDragon
Copy link
Contributor Author

I can try to see if I notice any difference in our prod rpc's Grafana metrics. But as @melekes noted, its not that fine grained, so not sure I'll notice anything

@cason
Copy link

cason commented Jul 22, 2024

I'll wait for @melekes be back, but I am good with the current state of this PR.

@cason cason added metrics and removed needs-information Waiting for additional information or feedback labels Jul 22, 2024
@melekes
Copy link
Collaborator

melekes commented Jul 26, 2024

The only part I don't agree is to report metrics only every 10s:

why? Prometheus default scrape interval is 1m https://prometheus.io/docs/prometheus/latest/configuration/configuration/

If you say so, great. Lets merge it.

Then, can we somehow compare the Prometheus output with and without this PR?

Please ignore my comment. We lose granularity if the scrape interval is 10s >= Prometheus scrape interval (1s). Let's lower it down to 1s. We are spawning the goroutine now, so it shouldn't be a burden.

melekes added 3 commits July 26, 2024 16:39
We lose granularity if the scrape interval is 10s >= Prometheus scrape
interval (10s). Let's lower it down to 1s. We are spawning the goroutine
now, so it shouldn't be a burden.
@melekes melekes enabled auto-merge July 26, 2024 12:41
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to remove line 44 added by another PR.

@melekes melekes added this pull request to the merge queue Jul 26, 2024
Merged via the queue into main with commit 94d42a9 Jul 26, 2024
@melekes melekes deleted the dev/reduce_p2p_metric_overhead branch July 26, 2024 13:33
mergify bot pushed a commit that referenced this pull request Jul 26, 2024
Closes #2840

We do this, by the calls to Prometheus (which have many allocations) to
be significantly batched, and importantly not blocking the send routine
and recv routine.

This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send.

By napkin estimate, this also saves 8% of newObject call time, which
hopefully also means over 8% of the GC overhead times

<!--

Please add a reference to the issue that this PR addresses and indicate
which
files are most critical to review. If it fully addresses a particular
issue,
please include "Closes #XXX" (where "XXX" is the issue number).

If this PR is non-trivial/large/complex, please ensure that you have
either
created an issue that the team's had a chance to respond to, or had some
discussion with the team prior to submitting substantial pull requests.
The team
can be reached via GitHub Discussions or the Cosmos Network Discord
server in
the #cometbft channel. GitHub Discussions is preferred over Discord as
it
allows us to keep track of conversations topically.
https://github.com/cometbft/cometbft/discussions

If the work in this PR is not aligned with the team's current
priorities, please
be advised that it may take some time before it is merged - especially
if it has
not yet been discussed with the team.

See the project board for the team's current priorities:
https://github.com/orgs/cometbft/projects/1

-->

---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [ ] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Daniel <daniel.cason@informal.systems>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
(cherry picked from commit 94d42a9)
melekes pushed a commit that referenced this pull request Jul 27, 2024
Closes #2840

We do this, by the calls to Prometheus (which have many allocations) to
be significantly batched, and importantly not blocking the send routine
and recv routine.

This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send. 

By napkin estimate, this also saves 8% of newObject call time, which
hopefully also means over 8% of the GC overhead times



---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [ ] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3411 done by
[Mergify](https://mergify.com).

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
ValarDragon added a commit to osmosis-labs/cometbft that referenced this pull request Aug 19, 2024
ValarDragon added a commit to osmosis-labs/cometbft that referenced this pull request Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lower allocation overhead of metrics in peer.Send

3 participants