perf(p2p): Reduce the p2p metrics overhead. by ValarDragon · Pull Request #3411 · cometbft/cometbft

ValarDragon · 2024-07-03T12:42:23Z

We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine.

This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send.

By napkin estimate, this also saves 8% of newObject call time, which hopefully also means over 8% of the GC overhead times

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments
Title follows the Conventional Commits spec

We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine. This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send

ValarDragon · 2024-07-03T16:25:32Z

This was a 60% reduction to GC time 👀 from one benchmark, so this actually sped up the net CPU profile time by 30% for Osmosis based on two 1 hour profiles.

(Note Osmosis GC has many other terms eliminated, e.g. debug logs, and my other open PR's. On main Comet branches, GC will be comet debug log dominated)

EDIT: I am worried there is noise in the GC reduction (e.g. something else also got lower in gossip as well, since this seems too good on more inspection) It seems like my reference for old GC time may have been baseline much slower than normal blocks.

cason · 2024-07-05T09:05:58Z

How did you get this performance improvement numbers? Because they are impressive.

I probably need a way to test connections in isolation to experimentally check the impact of this changes.

cason · 2024-07-05T09:12:06Z

A second point to made, that worries from a while, is the fact that metric caches are private types. This is not exactly a problem, as they are internal to this package, but the problem is that they are provided in some constructors (e.g., peerConfig), rendering impossible to use Public types in the p2p package because it is private.

In summary, is there any way to render the metrics cache types part of the public p2p.Metrics type?

cason

Some general comments regarding the implementation.

For me the only blocker is that we are updating these stats every 10 seconds, instead of every send/receive call. Should we reconsider this granularity?

p2p/metrics.go

p2p/peer.go

p2p/metrics.go

melekes · 2024-07-05T11:50:02Z

A more general comment is: how confident are we that we need these metrics? If they are costing us so much CPU.

Co-authored-by: Daniel <daniel.cason@informal.systems>

ValarDragon · 2024-07-05T21:52:22Z

I actually found them really helpful for debugging bandwidth. The metric is still a cost after this pr, but I need to reprofile after casons point that I'm copying the struct here

p2p/metrics.go

remove ValueToMetricLabel

melekes · 2024-07-09T08:32:56Z

@ValarDragon I've removed metricsLabelCache because I've realized it's now unused. I.e., metric labels are not shared between peer caches because you've removed the ValueToMetricLabel call from peer. Could you get the profile with the latest changes and confirm that the overhead is gone? Thanks 🙏

cason

Great! I loved removing this mlc private parameter for Peer and other (public) components.

The only part I don't agree is to report metrics only every 10s:

cometbft/p2p/peer.go

Line 20 in f503f25

const metricsTickerDuration = 10 * time.Second

While now we report then on every send/receive method.

Should we reduce it to, say, 1s, 0.5s?

cason · 2024-07-10T09:24:07Z

The only part I don't agree is to report metrics only every 10s:

why? Prometheus default scrape interval is 1m https://prometheus.io/docs/prometheus/latest/configuration/configuration/

If you say so, great. Lets merge it.

Then, can we somehow compare the Prometheus output with and without this PR?

ValarDragon · 2024-07-16T13:50:26Z

I can try to see if I notice any difference in our prod rpc's Grafana metrics. But as @melekes noted, its not that fine grained, so not sure I'll notice anything

cason · 2024-07-22T09:13:30Z

I'll wait for @melekes be back, but I am good with the current state of this PR.

melekes · 2024-07-26T11:05:02Z

The only part I don't agree is to report metrics only every 10s:

why? Prometheus default scrape interval is 1m https://prometheus.io/docs/prometheus/latest/configuration/configuration/

If you say so, great. Lets merge it.

Then, can we somehow compare the Prometheus output with and without this PR?

Please ignore my comment. We lose granularity if the scrape interval is 10s >= Prometheus scrape interval (1s). Let's lower it down to 1s. We are spawning the goroutine now, so it shouldn't be a burden.

We lose granularity if the scrape interval is 10s >= Prometheus scrape interval (10s). Let's lower it down to 1s. We are spawning the goroutine now, so it shouldn't be a burden.

cason · 2024-07-26T13:19:37Z

p2p/test_util.go

We have to remove line 44 added by another PR.

Closes #2840 We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine. This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send. By napkin estimate, this also saves 8% of newObject call time, which hopefully also means over 8% of the GC overhead times  --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com> (cherry picked from commit 94d42a9)

Closes #2840 We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine. This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send. By napkin estimate, this also saves 8% of newObject call time, which hopefully also means over 8% of the GC overhead times --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3411 done by [Mergify](https://mergify.com). Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

Reduce the p2p metrics overhead.

162cb76

We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine. This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send

ValarDragon requested a review from a team as a code owner July 3, 2024 12:42

ValarDragon requested a review from a team July 3, 2024 12:42

cason reviewed Jul 5, 2024

View reviewed changes

ValarDragon and others added 2 commits July 5, 2024 22:49

Update p2p/metrics.go

9a1d1ab

Co-authored-by: Daniel <daniel.cason@informal.systems>

Update p2p/metrics.go

4f33495

Co-authored-by: Daniel <daniel.cason@informal.systems>

cason added the wip Work in progress label Jul 8, 2024

melekes reviewed Jul 9, 2024

View reviewed changes

p2p/metrics.go Outdated Show resolved Hide resolved

p2p/metrics.go Outdated Show resolved Hide resolved

melekes added 2 commits July 9, 2024 12:13

refactor metrics

fcfb397

remove ValueToMetricLabel

Merge branch 'main' into dev/reduce_p2p_metric_overhead

b47c09e

melekes added p2p and removed wip Work in progress labels Jul 9, 2024

remove metricsLabelCache altogether

09b40b9

melekes added the needs-information Waiting for additional information or feedback label Jul 9, 2024

melekes requested a review from cason July 9, 2024 08:33

Merge branch 'main' into dev/reduce_p2p_metric_overhead

f503f25

cason approved these changes Jul 10, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

melekes added the backport-to-v1.x label Jul 10, 2024

cason added metrics and removed needs-information Waiting for additional information or feedback labels Jul 22, 2024

melekes added 3 commits July 26, 2024 16:39

decrease metricsTickerDuration

e0758da

We lose granularity if the scrape interval is 10s >= Prometheus scrape interval (10s). Let's lower it down to 1s. We are spawning the goroutine now, so it shouldn't be a burden.

Merge branch 'main' into dev/reduce_p2p_metric_overhead

0bb89db

Merge branch 'main' into dev/reduce_p2p_metric_overhead

b20d2fc

melekes enabled auto-merge July 26, 2024 12:41

cason reviewed Jul 26, 2024

View reviewed changes

p2p/test_util.go

Copy link

cason Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to remove line 44 added by another PR.

p2p: removed newMetricsLabelCache() from tests

9d6f350

melekes added this pull request to the merge queue Jul 26, 2024

Merged via the queue into main with commit 94d42a9 Jul 26, 2024

melekes deleted the dev/reduce_p2p_metric_overhead branch July 26, 2024 13:33

mergify bot mentioned this pull request Jul 26, 2024

perf(p2p): Reduce the p2p metrics overhead. (backport #3411) #3569

Merged

4 tasks

ValarDragon added a commit to osmosis-labs/cometbft that referenced this pull request Aug 19, 2024

perf(p2p): Reduce the p2p metrics overhead. cometbft#3411

6239b99

ValarDragon added a commit to osmosis-labs/cometbft that referenced this pull request Aug 19, 2024

perf(p2p): Reduce the p2p metrics overhead. cometbft#3411 (#140)

1013ce6

Conversation

ValarDragon commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

ValarDragon commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cason commented Jul 5, 2024

Uh oh!

cason commented Jul 5, 2024

Uh oh!

cason left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melekes commented Jul 5, 2024

Uh oh!

ValarDragon commented Jul 5, 2024

Uh oh!

Uh oh!

Uh oh!

melekes commented Jul 9, 2024

Uh oh!

cason left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

cason commented Jul 10, 2024

Uh oh!

ValarDragon commented Jul 16, 2024

Uh oh!

cason commented Jul 22, 2024

Uh oh!

melekes commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cason Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ValarDragon commented Jul 3, 2024 •

edited

Loading

ValarDragon commented Jul 3, 2024 •

edited

Loading

melekes commented Jul 26, 2024 •

edited

Loading