Skip to content

feat: basic holepunch metrics#3748

Merged
ramfox merged 11 commits intomainfrom
Frando/mp-metrics-hp
Jan 8, 2026
Merged

feat: basic holepunch metrics#3748
ramfox merged 11 commits intomainfrom
Frando/mp-metrics-hp

Conversation

@Frando
Copy link
Copy Markdown
Member

@Frando Frando commented Dec 8, 2025

Description

This adds holepunching metrics, a few are because they are easy and give some insight. The main metrics are:

  • num_conns_opened: an existing metric counting the number of connections opened.
  • num_conns_direct: a new metric counting the number of connections which have ever been direct.

Together those do not give a direct count of the number of successful holepunches. But does give an accurate count of number of failed holepunches.

The holepunch_attempts metric is much more of an internal value that does not give direct information.

There are two more metrics:

  • paths_direct: number of paths that are direct
  • paths_relay: number of paths that are relayed

If paths_direct_total < paths_relay_total then you have some failed holepunching. The number will not be as accurate as using num_conns_opened_total and num_conns_direct_total however and is also more indicative of the total number of paths involved.

Fixes #3695

Breaking Changes

Some metrics are removed, but these have not been released.

Notes & open questions

One could also argue that you could count the "total" and "direct" remotes. This gives you a less-inflated number on how successful hoplepunching is. Though we really give an indication of the number of failed holepunches instead, so the over-inflation doesn't happen really.

The main reason I consider per-remote counting worse however is that we forget all state about a remote when a RemoteStateActor is stopped. This can then later come back and again result in wrong numbers.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.

@Frando Frando force-pushed the Frando/mp-metrics-hp branch from eff1c4b to acbc048 Compare December 8, 2025 13:35
@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 8, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3748/docs/iroh/

Last updated: 2026-01-08T15:51:41Z

@n0bot n0bot bot added this to iroh Dec 8, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Dec 8, 2025
@Frando Frando force-pushed the Frando/mp-metrics-hp branch from ca85a9f to f6c5e49 Compare December 9, 2025 11:33
@Frando Frando marked this pull request as ready for review December 9, 2025 13:57
@Frando Frando changed the title draft: holepunch metrics feat: holepunch metrics Dec 9, 2025
@Frando Frando changed the title feat: holepunch metrics feat: basic holepunch metrics Dec 9, 2025
Base automatically changed from Frando/mp-metrics-basics to feat-multipath December 9, 2025 21:38
@Frando Frando force-pushed the Frando/mp-metrics-hp branch from 0a4ba70 to fed03ef Compare December 10, 2025 12:18
@Frando Frando requested a review from flub December 10, 2025 12:23
@flub
Copy link
Copy Markdown
Contributor

flub commented Dec 10, 2025

Help, I find this very hard to think about. What do we really want to measure here? I don't have a good answer to this so find commenting on the PR hard. @Arqu maybe has input on what we really need to have?

Things that are easy to measure:

  • number of times we start holepuching
  • number of times we explicitly open a path without holepunching (e.g. because it was holepunched in another connection).
  • number of times a path was opened for any reason (e.g. on the server side or because of holepunching or because of explicit open_path).
  • number of times we abandon a path.
  • number of times a path is abandoned.

Things that are hard(er) to measure:

  • number of times a path is opened from nat traversal (essentially shown in this PR currently though you also would have to be sure it wasn't from an explict open_path call which is not easy right now)
  • number of successful holepunches
  • number of failed holepunches

My inclination is to concentrate on the easy ones, but do you learn anything useful from those? You can't do any maths to tell you how good your holepunching is I think, so what's the point even? This goes back to, what are we trying to measure?

/// The number of NAT traversal attempts initiated.
pub nat_traversal: Counter,
/// The number of remote endpoints for which of NAT traversal was initiated.
pub remote_holepunch_attempts: Counter,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "remote_" prefix makes me think this was initiated by the remote.

Essentially the metrics you added are:

  • remotes connected to ("remotes_connection_established"?)
  • remotes holepunched ("remotes_connection_direct"?)

IIUC. This could be useful I guess. Not sure, it's also a bit weird (see separate comment).

&& hp.remote_candidates.contains(ip_addr)
{
self.has_holepunched = true;
self.metrics.remote_holepunch_success.inc();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with this impl is that this metric would only be incremented on the client, and not the server. But that's a bit counter-intuitive from the user point of view.

@flub
Copy link
Copy Markdown
Contributor

flub commented Dec 10, 2025

So I think the original intention was that you could have two metrics like:

  • no of connections that were on relay and had holepunch attempts
  • no of connections that moved from relay to direct after holepunching

And now you can do maths on them because you can compute the number of connections that did not holepunch and have a holepunching rate. But these metrics fall under the "hard" metrics 😞 I'm not sure how to do them.

@Arqu
Copy link
Copy Markdown
Collaborator

Arqu commented Dec 16, 2025

I dropped the ball a bit here. Think @flub is on the right path here.
I guess a number of things are here in play. Main question is what and for whom we're measuring things here. Iroh side metrics should generally serve the user and should be concerned with the performance of "their" node respectively.

That said I think it's fine that this is client centric and whether my endpoint managed to holepunch. Still might be slightly counter intuitive as it basically tells you more about the state of the other side network config. Ideally we would be aware of a hole punching success towards ourselves too.

The 2nd bit and more crucially, I don't think this should work to establish a network wide holepunch rate. You can fumble some maths and see some grouped averages or distributions between clients and their holepunch rates but nothing definitive. I intend to do a follow up PR that does a similar thing but from the Relay PoV which should help make more informed guesstimates on the network level. Still pretty hard though as we are operating with very little info.

For this PR I'd focus on getting the "easy" ones @flub listed as they give a lot of flexibility to do math and explore the state. However none of these counters I would limit to a 0-1 state but just let them increment naturally on what we deem the relevant event. It's easy enough to do the subtractions to get the actual state from it.

@flub flub added this to the iroh: v1.0.0-rc.0 milestone Dec 22, 2025
@flub flub modified the milestones: iroh: v1.0.0-rc.0, iroh: v0.96 Jan 7, 2026
@flub flub self-assigned this Jan 7, 2026
@flub flub changed the base branch from feat-multipath to main January 7, 2026 18:42
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 7, 2026

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: ce45168

@flub flub requested a review from Arqu January 7, 2026 19:08
@Frando
Copy link
Copy Markdown
Member Author

Frando commented Jan 8, 2026

LGTM, let's merge this! (Can't GitHub approve because it's my PR)

We can iterate further on this until 1.0 but this gives us a baseline to try things out, so let's get this in.

@flub flub moved this from 🏗 In progress to 👀 In review in iroh Jan 8, 2026
Copy link
Copy Markdown
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(channelling @Frando 's LGTM)

@ramfox ramfox added this pull request to the merge queue Jan 8, 2026
Merged via the queue into main with commit 37bda14 Jan 8, 2026
28 of 29 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in iroh Jan 8, 2026
@flub flub deleted the Frando/mp-metrics-hp branch January 8, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

Add holepunching metrics

4 participants