Conversation
eff1c4b to
acbc048
Compare
|
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3748/docs/iroh/ Last updated: 2026-01-08T15:51:41Z |
ca85a9f to
f6c5e49
Compare
0a4ba70 to
fed03ef
Compare
|
Help, I find this very hard to think about. What do we really want to measure here? I don't have a good answer to this so find commenting on the PR hard. @Arqu maybe has input on what we really need to have? Things that are easy to measure:
Things that are hard(er) to measure:
My inclination is to concentrate on the easy ones, but do you learn anything useful from those? You can't do any maths to tell you how good your holepunching is I think, so what's the point even? This goes back to, what are we trying to measure? |
iroh/src/magicsock/metrics.rs
Outdated
| /// The number of NAT traversal attempts initiated. | ||
| pub nat_traversal: Counter, | ||
| /// The number of remote endpoints for which of NAT traversal was initiated. | ||
| pub remote_holepunch_attempts: Counter, |
There was a problem hiding this comment.
The "remote_" prefix makes me think this was initiated by the remote.
Essentially the metrics you added are:
- remotes connected to ("remotes_connection_established"?)
- remotes holepunched ("remotes_connection_direct"?)
IIUC. This could be useful I guess. Not sure, it's also a bit weird (see separate comment).
| && hp.remote_candidates.contains(ip_addr) | ||
| { | ||
| self.has_holepunched = true; | ||
| self.metrics.remote_holepunch_success.inc(); |
There was a problem hiding this comment.
One issue with this impl is that this metric would only be incremented on the client, and not the server. But that's a bit counter-intuitive from the user point of view.
|
So I think the original intention was that you could have two metrics like:
And now you can do maths on them because you can compute the number of connections that did not holepunch and have a holepunching rate. But these metrics fall under the "hard" metrics 😞 I'm not sure how to do them. |
|
I dropped the ball a bit here. Think @flub is on the right path here. That said I think it's fine that this is client centric and whether my endpoint managed to holepunch. Still might be slightly counter intuitive as it basically tells you more about the state of the other side network config. Ideally we would be aware of a hole punching success towards ourselves too. The 2nd bit and more crucially, I don't think this should work to establish a network wide holepunch rate. You can fumble some maths and see some grouped averages or distributions between clients and their holepunch rates but nothing definitive. I intend to do a follow up PR that does a similar thing but from the Relay PoV which should help make more informed guesstimates on the network level. Still pretty hard though as we are operating with very little info. For this PR I'd focus on getting the "easy" ones @flub listed as they give a lot of flexibility to do math and explore the state. However none of these counters I would limit to a 0-1 state but just let them increment naturally on what we deem the relevant event. It's easy enough to do the subtractions to get the actual state from it. |
And count a few other fun things for fun.
|
LGTM, let's merge this! (Can't GitHub approve because it's my PR) We can iterate further on this until 1.0 but this gives us a baseline to try things out, so let's get this in. |
Description
This adds holepunching metrics, a few are because they are easy and give some insight. The main metrics are:
num_conns_opened: an existing metric counting the number of connections opened.num_conns_direct: a new metric counting the number of connections which have ever been direct.Together those do not give a direct count of the number of successful holepunches. But does give an accurate count of number of failed holepunches.
The
holepunch_attemptsmetric is much more of an internal value that does not give direct information.There are two more metrics:
paths_direct: number of paths that are directpaths_relay: number of paths that are relayedIf
paths_direct_total < paths_relay_totalthen you have some failed holepunching. The number will not be as accurate as usingnum_conns_opened_totalandnum_conns_direct_totalhowever and is also more indicative of the total number of paths involved.Fixes #3695
Breaking Changes
Some metrics are removed, but these have not been released.
Notes & open questions
One could also argue that you could count the "total" and "direct" remotes. This gives you a less-inflated number on how successful hoplepunching is. Though we really give an indication of the number of failed holepunches instead, so the over-inflation doesn't happen really.
The main reason I consider per-remote counting worse however is that we forget all state about a remote when a
RemoteStateActoris stopped. This can then later come back and again result in wrong numbers.Change checklist