Skip to content

feat(p2p): pre-allocate static labels used to produce metrics#2851

Merged
cason merged 10 commits intomainfrom
cason/2840-chID-labels
Apr 23, 2024
Merged

feat(p2p): pre-allocate static labels used to produce metrics#2851
cason merged 10 commits intomainfrom
cason/2840-chID-labels

Conversation

@cason
Copy link

@cason cason commented Apr 18, 2024

Addresses #2840.

The same approach taken for the send() method is also adopted for the onReceive() method.


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

@cason cason linked an issue Apr 18, 2024 that may be closed by this pull request
@cason cason marked this pull request as ready for review April 18, 2024 16:02
@cason cason requested a review from a team as a code owner April 18, 2024 16:02
@cason cason requested a review from a team April 18, 2024 16:02
@cason cason self-assigned this Apr 18, 2024
@cason cason added backport-to-v0.34.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Apr 18, 2024
@cason cason changed the title feat(p2p): pre-allocate channel ID labels for metrics feat(p2p): pre-allocate static labels used to produce metrics Apr 18, 2024
@cason cason marked this pull request as draft April 18, 2024 16:07
@cason cason marked this pull request as ready for review April 18, 2024 16:37
} else if !p.hasChannel(chID) {
return false
}
metricLabelValue := p.mlc.ValueToMetricLabel(msg)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether there was a reason for calling this here, as msg can be changed in line 276...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it breaks everything. Restored.

@@ -43,6 +43,19 @@ type Metrics struct {
type metricsLabelCache struct {
mtx *sync.RWMutex
messageLabelNames map[reflect.Type]string
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably also pre-allocate this guy, which would allow us for getting rid of the lock.

Copy link
Collaborator

@andynog andynog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment around metric allocation, I don't understand the metric package well but just looking at the dependency I think there might still some problems with allocations.

}
p.metrics.PeerReceiveBytesTotal.With(labels...).Add(float64(len(msgBytes)))
p.metrics.MessageReceiveBytesTotal.With("message_type", p.mlc.ValueToMetricLabel(msg)).Add(float64(len(msgBytes)))
p.metrics.PeerReceiveBytesTotal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the go-kit/metrics dependency, using the With factory I believe might allocate a Counter (at least a pointer) for every combination of peer ID and channel ID on a received message ? If so maybe the logic to cache should be not on labels but on the Counter (pair or peer ID and chID) and then you just call the Add in the receive msg routine ?
https://github.com/go-kit/kit/blob/78fbbceece7bbcf073bee814a7772f4397ea756c/metrics/prometheus/prometheus.go#L36

Copy link
Contributor

@ValarDragon ValarDragon Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my understanding of the bottleneck here as well. (That Allocations come from the With call, and thats what we want to cache)

Copy link
Contributor

@ValarDragon ValarDragon Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got a recent CPU profile for a synced full node for 1 hour, the With call is definitely what needs to be cached:
image

image

(Ideally we can remove the Add call's makeLabel work subsequently as well, but tbh I don't understand that)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insights @ValarDragon

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to cache the metrics.Counter instances. The problem is the cardinality.

We have 22 message types, if I am not wrong, and 10 channels. With standard configuration, we can have from 10 to 50 peers. Ok, at the end is not that impossible.

@cason
Copy link
Author

cason commented Apr 19, 2024

I don't understand the metric package well but just looking at the dependency I think there might still some problems with allocations.

The problem is that this is very hard to evaluate without running a bunch of nodes...

@ValarDragon
Copy link
Contributor

Can help with getting more mainnet profiles!

Copy link
Collaborator

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@cason
Copy link
Author

cason commented Apr 19, 2024

Maybe we can merge this, while it is pretty clear that it does not solve the problem...

@ValarDragon
Copy link
Contributor

That sgtm! These sprintf calls were avoiding do show up in cpu profiles so absolutely helpful to get this eliminated!

@ValarDragon
Copy link
Contributor

ValarDragon commented Apr 20, 2024

CPU benchmark for the curious:
image

This is on a caught up Osmosis full node for 1 hr

(6.25s were the sprintf call here thats getting eliminated, 1s in the string(peer.ID()) call)

@cason
Copy link
Author

cason commented Apr 22, 2024

Not sure about backporting this code though.

Copy link
Collaborator

@andynog andynog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I believe if it doesn't fully resolve the problems related to With and Add it does change the logic to pre-allocate of the Sprintf. Let's merge this @cason and ask Dev to test. Then we can decide to backport it.

@cason cason removed backport-to-v0.37.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Apr 23, 2024
@cason cason added this pull request to the merge queue Apr 23, 2024
Merged via the queue into main with commit 7673d74 Apr 23, 2024
@cason cason deleted the cason/2840-chID-labels branch April 23, 2024 15:43
mergify bot pushed a commit that referenced this pull request Apr 23, 2024
Addresses #2840.

The same approach taken for the `send()` method is also adopted for the
`onReceive()` method.

---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 7673d74)
melekes added a commit that referenced this pull request Apr 24, 2024
#2851) (#2877)

Addresses #2840.

The same approach taken for the `send()` method is also adopted for the
`onReceive()` method.

---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2851 done by
[Mergify](https://mergify.com).

Co-authored-by: Daniel <daniel.cason@informal.systems>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
github-merge-queue bot pushed a commit that referenced this pull request May 2, 2024
…ng (#2984)

In the metric `MessageReceiveBytesTotal`, the only value produced in the
`message_type` label is `v1_Message`. This is because we are reading the
`msg` variable before the message is unwrapped. This bug was recently
introduced in #2851.

---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [ ] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
mergify bot pushed a commit that referenced this pull request May 2, 2024
…ng (#2984)

In the metric `MessageReceiveBytesTotal`, the only value produced in the
`message_type` label is `v1_Message`. This is because we are reading the
`msg` variable before the message is unwrapped. This bug was recently
introduced in #2851.

---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [ ] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 33729e6)
hvanz added a commit that referenced this pull request May 3, 2024
…ng (backport #2984) (#2985)

In the metric `MessageReceiveBytesTotal`, the only value produced in the
`message_type` label is `v1_Message`. This is because we are reading the
`msg` variable before the message is unwrapped. This bug was recently
introduced in #2851.

---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [ ] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2984 done by
[Mergify](https://mergify.com).

Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lower allocation overhead of metrics in peer.Send

5 participants