Skip to content

refactor(iroh, iroh-relay)!: Relay should not kill old connections when same endpoint id connects#3921

Merged
Frando merged 25 commits intomainfrom
Frando/refactor-relay-close
Feb 20, 2026
Merged

refactor(iroh, iroh-relay)!: Relay should not kill old connections when same endpoint id connects#3921
Frando merged 25 commits intomainfrom
Frando/refactor-relay-close

Conversation

@Frando
Copy link
Copy Markdown
Member

@Frando Frando commented Feb 9, 2026

Description

When the iroh relay receives a connection from an endpoint to which it already has a connection, it currently drops the old one immediately. The old endpoint did not receive any indication as to why its connection was dropped, so it would reconnect again right away. The connection would succeed: Now it was its turn to be accepted, and the other endpoint would be dropped from the relay. This back-and-forth loop would continue infinitely.

This PR changes this in the following way:

  • When a relay server receives a connection from an endpoint it for which it already has a connection, the new connection becomes the target for all future messages.
  • The old connection however is not dropped. Instead, it is continued until the client stops the connection. The client can still send messages, but it won't receive any further messages.
  • When a connection is moved into this "inactive" state, the server sends a final Health message to the client, with a text string informing the client that it is now inactive because another endpoint connected. In iroh, we warn! log this message.
  • When an active client disconnects, and there is an inactive client, the lastest inactive client will be promoted to active again

Breaking Changes

  • iroh_relay::server::client::RunError variants changed

Notes & open questions

  • I'd prefer to add a typed frame for the Health issue. However, adding new frames to the relay protocol currently will break all older clients. Therefore, this uses the stringly health frame for now.
  • We might want to discuss further changes, i.e. if the "old" endpoint should maybe do something more heavy than just logging a warn message. It will be in a weird state onwards: It still can send messages to other endpoints over the relay, but will never receive responses. This means that all connection attempts that go over the relay will time out.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

Fixes #3813

@Frando Frando force-pushed the Frando/refactor-relay-close branch from 4d33c2d to e15cd96 Compare February 9, 2026 13:57
@Frando Frando force-pushed the Frando/refactor-relay-close branch from 333e2ac to 6d80b55 Compare February 9, 2026 13:59
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 9, 2026

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3921/docs/iroh/

Last updated: 2026-02-20T12:16:58Z

@n0bot n0bot bot added this to iroh Feb 9, 2026
@github-project-automation github-project-automation bot moved this to 🚑 Needs Triage in iroh Feb 9, 2026
@dignifiedquire dignifiedquire moved this from 🚑 Needs Triage to 👀 In review in iroh Feb 9, 2026
@Frando Frando requested a review from ramfox February 11, 2026 12:20
@Frando Frando changed the title refactor(iroh, iroh-relay): Add explicit close frame and better deal with disconnects due to same endpoint id refactor(iroh, iroh-relay): Relay should not kill old connections when same endpoint id connects Feb 11, 2026
@Frando Frando changed the title refactor(iroh, iroh-relay): Relay should not kill old connections when same endpoint id connects refactor(iroh, iroh-relay)!: Relay should not kill old connections when same endpoint id connects Feb 12, 2026

// Wait a bit. `online` does not mean that the connection to the home relay was *established*,
// only that the home relay was *chosen* based on the net report probes.
tokio::time::sleep(Duration::from_millis(500)).await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this screams flakiness, can we detect this differently?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, no. Maybe online should only resolve once we're actually connected to the home relay?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, potentially that would make sense

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change to the test that uses a log assertion instead of a timeout, so this can't flake anymore.

I'd like to leave adding a proper API for this to a followup though as it really is unrelated to the change in this PR.

// We assert that we get the warn log once for endpoint 1, and not at all for endpoint 2.
logs_assert(|logs| {
let expected_line = |line: &str| {
line.contains("WARN") && line.contains("Another endpoint connected with the same endpoint id. No more messages will be received")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we check this differently than log assertions? :/

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, no. There's nothing on the endpoint to check relay status. Maybe there should be.

@ramfox ramfox added this to the iroh: v0.97 milestone Feb 17, 2026
@Frando Frando force-pushed the Frando/refactor-relay-close branch from b420ff2 to 7cf0ed9 Compare February 18, 2026 09:48
Copy link
Copy Markdown
Member

@ramfox ramfox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! especially with the proposed follow ups.

@Frando Frando enabled auto-merge February 20, 2026 12:16
@github-actions
Copy link
Copy Markdown

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: f8406c1

@Frando Frando added this pull request to the merge queue Feb 20, 2026
Merged via the queue into main with commit 1b4ee2a Feb 20, 2026
51 of 54 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in iroh Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

Detect shared secret key usage

3 participants