Skip to content

feat(relay): expose circuit-v2 relay-service resource knobs#1031

Merged
mudler merged 2 commits into
masterfrom
feat/relay-service-tuning
May 31, 2026
Merged

feat(relay): expose circuit-v2 relay-service resource knobs#1031
mudler merged 2 commits into
masterfrom
feat/relay-service-tuning

Conversation

@mudler

@mudler mudler commented May 31, 2026

Copy link
Copy Markdown
Owner

EdgeVPN previously never enabled libp2p's circuit-v2 relay service (only the relay client via DefaultEnableRelay). That meant publicly reachable cluster peers refused to carry relayed traffic for NAT-traversed peers that failed to DCUtR hole-punch (QEMU slirp, CGNAT, double-NAT).

This change unconditionally enables libp2p.EnableRelayService and exposes the relayv2.Resources tunables as CLI flags / env vars so operators can widen the small libp2p defaults (128KB/2min/16 circuits/2048 buffer) when they need cluster peers to relay bulk transfers (e.g. model files for distributed inference):

--relay-service-max-data EDGEVPN_RELAY_MAX_DATA 1 GiB
--relay-service-max-duration EDGEVPN_RELAY_MAX_DURATION 30m
--relay-service-max-circuits EDGEVPN_RELAY_MAX_CIRCUITS 64
--relay-service-reservation-ttl EDGEVPN_RELAY_RESERVATION_TTL 1h
--relay-service-buffer-size EDGEVPN_RELAY_BUFFER_SIZE 64 KiB

mudler added 2 commits May 31, 2026 08:28
EdgeVPN previously never enabled libp2p's circuit-v2 relay *service*
(only the relay *client* via DefaultEnableRelay). That meant publicly
reachable cluster peers refused to carry relayed traffic for NAT-traversed
peers that failed to DCUtR hole-punch (QEMU slirp, CGNAT, double-NAT).

This change unconditionally enables libp2p.EnableRelayService and exposes
the relayv2.Resources tunables as CLI flags / env vars so operators can
widen the small libp2p defaults (128KB/2min/16 circuits/2048 buffer) when
they need cluster peers to relay bulk transfers (e.g. model files for
distributed inference):

  --relay-service-max-data         EDGEVPN_RELAY_MAX_DATA          1 GiB
  --relay-service-max-duration     EDGEVPN_RELAY_MAX_DURATION      30m
  --relay-service-max-circuits     EDGEVPN_RELAY_MAX_CIRCUITS      64
  --relay-service-reservation-ttl  EDGEVPN_RELAY_RESERVATION_TTL   1h
  --relay-service-buffer-size      EDGEVPN_RELAY_BUFFER_SIZE       64 KiB
…fering

When operators don't want a node to act as a circuit-v2 relay for
others (resource-constrained edge nodes, untrusted environments,
deployments where only a few designated nodes should relay), set
--relay-service=false / EDGEVPN_RELAY_SERVICE=false / programmatic
Connection.RelayService.Disabled=true. The node still runs as a relay
client (can reserve slots on OTHER relays via AutoRelay) — only the
incoming-reservation service is skipped.

The struct field is named Disabled (not Enabled) so the Go zero value
preserves the prior "always offer relay service" behaviour for
programmatic callers constructing &config.Config{} directly.

Adds TestRelayServiceDisabledSkipsLibp2pOption which asserts that
ToOpts produces strictly fewer node options with Disabled=true (the
libp2p.EnableRelayService wrapper disappears) and that both variants
still produce a constructible Node.

A follow-up will add a NetworkOnly mode (relay-service ACL gated on
ledger membership) so cluster relays don't service random internet
peers that found us via DHT.

Assisted-by: Claude:claude-opus-4-7
@mudler mudler force-pushed the feat/relay-service-tuning branch from ae06b87 to 7cc5c3a Compare May 31, 2026 09:30
@mudler mudler merged commit c9bdd0f into master May 31, 2026
3 checks passed
mudler added a commit that referenced this pull request May 31, 2026
Follow-up to #1031. With the relay service offered by every edgevpn
node, anyone on the public DHT who finds us can reserve a slot and
get us to carry their traffic. That's fine for a permissive overlay,
but in many deployments — especially private clusters — only network
members should be allowed to use us as a relay.

Adds a relayv2.ACLFilter (NetworkOnlyACL) that consults the local
ledger's alive bucket. The cadence is driven by a NetworkService that
periodically snapshots the bucket into the ACL via an atomic.Pointer
swap; AllowReserve checks reduce to a constant-time map lookup.

Bootstrap window:
  A peer joining for the first time is not yet in our alive bucket
  (it needs to join gossipsub to write to it). If we strict-gated
  from t=0 a new peer could deadlock trying to reserve its way in.
  The ACL therefore allows ALL reservations until the first
  successful alive-bucket snapshot, then switches to strict mode.
  In practice the window is at most one refresh tick (default 30s),
  and the alive service starts gossiping its own host ID immediately
  on startup so the first refresh almost always finds at least the
  local node.

AllowConnect is left permissive (return true): we gate the
reservation step, not in-flight relayed sessions, so a peer's
existing tunnel doesn't get yanked if the alive bucket briefly
flickers.

Knobs:
  --relay-service-network-only / EDGEVPN_RELAY_SERVICE_NETWORK_ONLY
    (bool, default false — opt-in for now until operators have a
    chance to verify the bootstrap behaviour against their topology).
  --relay-service-acl-refresh / EDGEVPN_RELAY_SERVICE_ACL_REFRESH
    (duration, default 30s — should be ≤ the alive-service announce
    interval).

Tests:
- TestNetworkOnlyACLBootstrapWindowAllowsAll — open until first Members
- TestNetworkOnlyACLMembersGates — strict mode rejects strangers
- TestNetworkOnlyACLAllowConnect — Connect always permitted
- TestNetworkOnlyACLMembersIsDefensivelyCopied — caller-map mutation
  after Members() does not race the ACL's strict-mode reads

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added a commit that referenced this pull request May 31, 2026
Follow-up to #1031. With the relay service offered by every edgevpn
node, anyone on the public DHT who finds us can reserve a slot and
get us to carry their traffic. In a private cluster only network
members should be able to use us as a relay.

Adds a relayv2.ACLFilter (NetworkOnlyACL) that consults the local
ledger's alive bucket. A NetworkService periodically snapshots the
bucket into the ACL via an atomic.Pointer swap; AllowReserve checks
reduce to a constant-time map lookup.

Bootstrap window:
  A peer joining for the first time is not yet in our alive bucket
  (it needs to join gossipsub to write to it). If we strict-gated
  from t=0 a new peer could deadlock trying to reserve its way in.
  The ACL therefore allows ALL reservations until the first
  successful alive-bucket snapshot, then switches to strict mode.

AllowConnect is left permissive (return true): we gate the
reservation step, not in-flight relayed sessions, so a peer's
existing tunnel doesn't get yanked if the alive bucket briefly
flickers.

Knobs:
  --relay-service-network-only / EDGEVPN_RELAY_SERVICE_NETWORK_ONLY
    bool, default TRUE — secure by default. Pass =false to open
    the relay to all peers.
  --relay-service-acl-refresh / EDGEVPN_RELAY_SERVICE_ACL_REFRESH
    duration, default 30s — should be <= the alive-service announce
    interval so churn is reflected within a couple of ticks.

Tests (Ginkgo, package config_test):
- bootstrap window admits any peer until the first Members call
- strict mode admits members listed in the set
- strict mode rejects non-members
- AllowConnect stays permissive regardless of membership
- Members defensively copies the caller's map (caller mutation
  after handover must not race readers)

The ACL lives in pkg/config alongside the existing relay-service
plumbing so pkg/node doesn't grow a relayv2 dependency. Wired via
the existing node.WithNetworkService pattern — same shape as the
alive service itself.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added a commit that referenced this pull request May 31, 2026
Follow-up to #1031. With the relay service offered by every edgevpn
node, anyone on the public DHT who finds us can reserve a slot and
get us to carry their traffic. In a private cluster only network
members should be able to use us as a relay.

Adds a relayv2.ACLFilter (NetworkOnlyACL) that consults the local
ledger's alive bucket. A NetworkService periodically snapshots the
bucket into the ACL via an atomic.Pointer swap; AllowReserve checks
reduce to a constant-time map lookup.

Bootstrap window:
  A peer joining for the first time is not yet in our alive bucket
  (it needs to join gossipsub to write to it). If we strict-gated
  from t=0 a new peer could deadlock trying to reserve its way in.
  The ACL therefore allows ALL reservations until the first
  successful alive-bucket snapshot, then switches to strict mode.

AllowConnect is left permissive (return true): we gate the
reservation step, not in-flight relayed sessions, so a peer's
existing tunnel doesn't get yanked if the alive bucket briefly
flickers.

Knobs:
  --relay-service-network-only / EDGEVPN_RELAY_SERVICE_NETWORK_ONLY
    bool, default TRUE — secure by default. Pass =false to open
    the relay to all peers.
  --relay-service-acl-refresh / EDGEVPN_RELAY_SERVICE_ACL_REFRESH
    duration, default 30s — should be <= the alive-service announce
    interval so churn is reflected within a couple of ticks.

Tests (Ginkgo, package config_test):
- bootstrap window admits any peer until the first Members call
- strict mode admits members listed in the set
- strict mode rejects non-members
- AllowConnect stays permissive regardless of membership
- Members defensively copies the caller's map

The ACL lives in pkg/config alongside the existing relay-service
plumbing so pkg/node doesn't grow a relayv2 dependency. Wired via
the existing node.WithNetworkService pattern.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added a commit that referenced this pull request May 31, 2026
#1032)

* feat(relay): NetworkOnly ACL — gate reservations on cluster membership

Follow-up to #1031. With the relay service offered by every edgevpn
node, anyone on the public DHT who finds us can reserve a slot and
get us to carry their traffic. In a private cluster only network
members should be able to use us as a relay.

Adds a relayv2.ACLFilter (NetworkOnlyACL) that consults the local
ledger's alive bucket. A NetworkService periodically snapshots the
bucket into the ACL via an atomic.Pointer swap; AllowReserve checks
reduce to a constant-time map lookup.

Bootstrap window:
  A peer joining for the first time is not yet in our alive bucket
  (it needs to join gossipsub to write to it). If we strict-gated
  from t=0 a new peer could deadlock trying to reserve its way in.
  The ACL therefore allows ALL reservations until the first
  successful alive-bucket snapshot, then switches to strict mode.

AllowConnect is left permissive (return true): we gate the
reservation step, not in-flight relayed sessions, so a peer's
existing tunnel doesn't get yanked if the alive bucket briefly
flickers.

Knobs:
  --relay-service-network-only / EDGEVPN_RELAY_SERVICE_NETWORK_ONLY
    bool, default TRUE — secure by default. Pass =false to open
    the relay to all peers.
  --relay-service-acl-refresh / EDGEVPN_RELAY_SERVICE_ACL_REFRESH
    duration, default 30s — should be <= the alive-service announce
    interval so churn is reflected within a couple of ticks.

Tests (Ginkgo, package config_test):
- bootstrap window admits any peer until the first Members call
- strict mode admits members listed in the set
- strict mode rejects non-members
- AllowConnect stays permissive regardless of membership
- Members defensively copies the caller's map

The ACL lives in pkg/config alongside the existing relay-service
plumbing so pkg/node doesn't grow a relayv2 dependency. Wired via
the existing node.WithNetworkService pattern.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(relay-acl): e2e reservation handshake against a real libp2p relay

The state-machine specs cover the ACL in isolation. These put it
inside a real circuit-v2 relay and exercise the actual reservation
handshake — the contract the NetworkService relies on in production.

Four scenarios:
- bootstrap window: fresh ACL accepts any peer (client.Reserve
  returns a valid voucher)
- strict + member: voucher binds to the right peer ID
- strict + stranger: libp2p surfaces the relay's refusal as
  "reservation error: status: PERMISSION_DENIED reason:
  reservation failed" — proves the ACL ran on the real handshake
- membership flip: pre-membership denied, then Members(set)
  including the joiner is called, then the very next reservation
  attempt succeeds (mimics the alive-bucket watcher's behaviour)

Uses libp2p.ForceReachabilityPublic() in the relay host so the
relay service actually advertises itself — without it AutoNAT may
refuse to register /hop and the e2e test can't reach the ACL code
path.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant