Skip to content

feat(relay): NetworkOnly ACL — gate reservations on cluster membership#1032

Merged
mudler merged 2 commits into
masterfrom
feat/relay-service-network-only
May 31, 2026
Merged

feat(relay): NetworkOnly ACL — gate reservations on cluster membership#1032
mudler merged 2 commits into
masterfrom
feat/relay-service-network-only

Conversation

@mudler

@mudler mudler commented May 31, 2026

Copy link
Copy Markdown
Owner

Follow-up to #1031. With the relay service offered by every edgevpn node, anyone on the public DHT who finds us can reserve a slot and get us to carry their traffic. That's fine for a permissive overlay, but in many deployments — especially private clusters — only network members should be allowed to use us as a relay.

Adds a relayv2.ACLFilter (NetworkOnlyACL) that consults the local ledger's alive bucket. The cadence is driven by a NetworkService that periodically snapshots the bucket into the ACL via an atomic.Pointer swap; AllowReserve checks reduce to a constant-time map lookup.

Bootstrap window:
A peer joining for the first time is not yet in our alive bucket
(it needs to join gossipsub to write to it). If we strict-gated
from t=0 a new peer could deadlock trying to reserve its way in.
The ACL therefore allows ALL reservations until the first
successful alive-bucket snapshot, then switches to strict mode.
In practice the window is at most one refresh tick (default 30s),
and the alive service starts gossiping its own host ID immediately
on startup so the first refresh almost always finds at least the
local node.

AllowConnect is left permissive (return true): we gate the reservation step, not in-flight relayed sessions, so a peer's existing tunnel doesn't get yanked if the alive bucket briefly flickers.

Knobs:
--relay-service-network-only / EDGEVPN_RELAY_SERVICE_NETWORK_ONLY
(bool, default false — opt-in for now until operators have a
chance to verify the bootstrap behaviour against their topology).
--relay-service-acl-refresh / EDGEVPN_RELAY_SERVICE_ACL_REFRESH
(duration, default 30s — should be ≤ the alive-service announce
interval).

Tests:

  • TestNetworkOnlyACLBootstrapWindowAllowsAll — open until first Members
  • TestNetworkOnlyACLMembersGates — strict mode rejects strangers
  • TestNetworkOnlyACLAllowConnect — Connect always permitted
  • TestNetworkOnlyACLMembersIsDefensivelyCopied — caller-map mutation after Members() does not race the ACL's strict-mode reads

Assisted-by: Claude:claude-opus-4-7

@mudler mudler force-pushed the feat/relay-service-network-only branch 2 times, most recently from cbe8b26 to 78b8fba Compare May 31, 2026 10:02
mudler added 2 commits May 31, 2026 10:03
Follow-up to #1031. With the relay service offered by every edgevpn
node, anyone on the public DHT who finds us can reserve a slot and
get us to carry their traffic. In a private cluster only network
members should be able to use us as a relay.

Adds a relayv2.ACLFilter (NetworkOnlyACL) that consults the local
ledger's alive bucket. A NetworkService periodically snapshots the
bucket into the ACL via an atomic.Pointer swap; AllowReserve checks
reduce to a constant-time map lookup.

Bootstrap window:
  A peer joining for the first time is not yet in our alive bucket
  (it needs to join gossipsub to write to it). If we strict-gated
  from t=0 a new peer could deadlock trying to reserve its way in.
  The ACL therefore allows ALL reservations until the first
  successful alive-bucket snapshot, then switches to strict mode.

AllowConnect is left permissive (return true): we gate the
reservation step, not in-flight relayed sessions, so a peer's
existing tunnel doesn't get yanked if the alive bucket briefly
flickers.

Knobs:
  --relay-service-network-only / EDGEVPN_RELAY_SERVICE_NETWORK_ONLY
    bool, default TRUE — secure by default. Pass =false to open
    the relay to all peers.
  --relay-service-acl-refresh / EDGEVPN_RELAY_SERVICE_ACL_REFRESH
    duration, default 30s — should be <= the alive-service announce
    interval so churn is reflected within a couple of ticks.

Tests (Ginkgo, package config_test):
- bootstrap window admits any peer until the first Members call
- strict mode admits members listed in the set
- strict mode rejects non-members
- AllowConnect stays permissive regardless of membership
- Members defensively copies the caller's map

The ACL lives in pkg/config alongside the existing relay-service
plumbing so pkg/node doesn't grow a relayv2 dependency. Wired via
the existing node.WithNetworkService pattern.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The state-machine specs cover the ACL in isolation. These put it
inside a real circuit-v2 relay and exercise the actual reservation
handshake — the contract the NetworkService relies on in production.

Four scenarios:
- bootstrap window: fresh ACL accepts any peer (client.Reserve
  returns a valid voucher)
- strict + member: voucher binds to the right peer ID
- strict + stranger: libp2p surfaces the relay's refusal as
  "reservation error: status: PERMISSION_DENIED reason:
  reservation failed" — proves the ACL ran on the real handshake
- membership flip: pre-membership denied, then Members(set)
  including the joiner is called, then the very next reservation
  attempt succeeds (mimics the alive-bucket watcher's behaviour)

Uses libp2p.ForceReachabilityPublic() in the relay host so the
relay service actually advertises itself — without it AutoNAT may
refuse to register /hop and the e2e test can't reach the ACL code
path.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler force-pushed the feat/relay-service-network-only branch from 78b8fba to cef854c Compare May 31, 2026 10:03
@mudler mudler merged commit 3447758 into master May 31, 2026
13 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant