perf(p2p): Reduce the p2p metrics overhead.#3411
Conversation
We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine. This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send
|
This was a 60% reduction to GC time 👀 from one benchmark, so this actually sped up the net CPU profile time by 30% for Osmosis based on two 1 hour profiles. (Note Osmosis GC has many other terms eliminated, e.g. debug logs, and my other open PR's. On main Comet branches, GC will be comet debug log dominated) EDIT: I am worried there is noise in the GC reduction (e.g. something else also got lower in gossip as well, since this seems too good on more inspection) It seems like my reference for old GC time may have been baseline much slower than normal blocks. |
|
How did you get this performance improvement numbers? Because they are impressive. I probably need a way to test connections in isolation to experimentally check the impact of this changes. |
|
A second point to made, that worries from a while, is the fact that metric caches are private types. This is not exactly a problem, as they are internal to this package, but the problem is that they are provided in some constructors (e.g., In summary, is there any way to render the metrics cache types part of the public |
cason
left a comment
There was a problem hiding this comment.
Some general comments regarding the implementation.
For me the only blocker is that we are updating these stats every 10 seconds, instead of every send/receive call. Should we reconsider this granularity?
|
A more general comment is: how confident are we that we need these metrics? If they are costing us so much CPU. |
Co-authored-by: Daniel <daniel.cason@informal.systems>
Co-authored-by: Daniel <daniel.cason@informal.systems>
|
I actually found them really helpful for debugging bandwidth. The metric is still a cost after this pr, but I need to reprofile after casons point that I'm copying the struct here |
remove ValueToMetricLabel
|
@ValarDragon I've removed |
cason
left a comment
There was a problem hiding this comment.
Great! I loved removing this mlc private parameter for Peer and other (public) components.
The only part I don't agree is to report metrics only every 10s:
Line 20 in f503f25
While now we report then on every send/receive method.
Should we reduce it to, say, 1s, 0.5s?
This comment was marked as resolved.
This comment was marked as resolved.
If you say so, great. Lets merge it. Then, can we somehow compare the Prometheus output with and without this PR? |
|
I can try to see if I notice any difference in our prod rpc's Grafana metrics. But as @melekes noted, its not that fine grained, so not sure I'll notice anything |
|
I'll wait for @melekes be back, but I am good with the current state of this PR. |
Please ignore my comment. We lose granularity if the scrape interval is 10s >= Prometheus scrape interval (1s). Let's lower it down to 1s. We are spawning the goroutine now, so it shouldn't be a burden. |
We lose granularity if the scrape interval is 10s >= Prometheus scrape interval (10s). Let's lower it down to 1s. We are spawning the goroutine now, so it shouldn't be a burden.
There was a problem hiding this comment.
We have to remove line 44 added by another PR.
Closes #2840 We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine. This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send. By napkin estimate, this also saves 8% of newObject call time, which hopefully also means over 8% of the GC overhead times <!-- Please add a reference to the issue that this PR addresses and indicate which files are most critical to review. If it fully addresses a particular issue, please include "Closes #XXX" (where "XXX" is the issue number). If this PR is non-trivial/large/complex, please ensure that you have either created an issue that the team's had a chance to respond to, or had some discussion with the team prior to submitting substantial pull requests. The team can be reached via GitHub Discussions or the Cosmos Network Discord server in the #cometbft channel. GitHub Discussions is preferred over Discord as it allows us to keep track of conversations topically. https://github.com/cometbft/cometbft/discussions If the work in this PR is not aligned with the team's current priorities, please be advised that it may take some time before it is merged - especially if it has not yet been discussed with the team. See the project board for the team's current priorities: https://github.com/orgs/cometbft/projects/1 --> --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com> (cherry picked from commit 94d42a9)
Closes #2840 We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine. This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send. By napkin estimate, this also saves 8% of newObject call time, which hopefully also means over 8% of the GC overhead times --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [ ] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3411 done by [Mergify](https://mergify.com). Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Closes #2840
We do this, by the calls to Prometheus (which have many allocations) to be significantly batched, and importantly not blocking the send routine and recv routine.
This is a 25% speedup to recvRoutine, and 60% speedup to peer.Send.
By napkin estimate, this also saves 8% of newObject call time, which hopefully also means over 8% of the GC overhead times
PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code comments