Skip to content

[RFC] Separate internal service identity from user auth in OpenClaw gateway #69066

@jetd1

Description

@jetd1

[RFC] Separate internal service identity from user auth in the OpenClaw gateway

Labels suggestion: rfc, area:gateway, area:auth, discussion, trusted-proxy
Keywords (for search): gateway auth, trusted-proxy, service identity, service account, loopback, localDirect, internal RPC, subagent auth, pairing, trusted_proxy_loopback_source


Summary

gateway.auth.mode is doing three jobs at once:

  1. User auth — who is the human on the other end (none / token / trusted-proxy / Tailscale).
  2. Internal service auth — who is the gateway's own child process (subagents, browser tool, cron, exec-approvals, CLI on the same host).
  3. Local operator auth — who is the person at 127.0.0.1 running openclaw status, openclaw cron list, etc.

All three flow through the same branch in authorizeGatewayConnect, gated by isLocalDirectRequest, the loopback guard in authorizeTrustedProxy, sharedAuthOk, controlUi.allowInsecureAuth, and pairing/device-identity checks — layered over time as exceptions. Every pass that tightens one of these knobs has ended up loosening or breaking a different one. The March–April 2026 trusted-proxy regressions (#59167, #43300, #26007, #59045, #59702, #60265, #62767, #63381, #63548, #63344, #67703, #67524, #67799) are all the same underlying coupling surfacing in different deployment shapes.

This RFC proposes making internal service identity a first-class, orthogonal axis, leaving existing user-auth modes untouched, and outlines a phased path that ships value week-one and does not require a single large PR.

Happy to drive the implementation — looking for agreement on shape and phasing first so the five in-flight PRs can be sequenced instead of racing each other.


The problem, in one picture

flowchart TD
  REQ[Incoming WS connection] --> MODE{auth.mode?}

  MODE -->|none| N[accept]
  MODE -->|token| T{token valid?}
  MODE -->|trusted-proxy| TP{trusted source?<br/>loopback guard?<br/>required headers?<br/>allowed user?}
  MODE -->|tailscale| TS[tailscale flow]

  T -->|no| TFX{localDirect?<br/>sharedAuthOk?<br/>allowInsecureAuth?<br/>device paired?}
  TP -->|fail| TPFX{localDirect?<br/>non-proxy failure?<br/>password set?<br/>loopback allowed?<br/>loopbackUser set?}

  subgraph PATCH[same decision tree reached by 3 different concerns]
    direction LR
    U[external user<br/>browser / remote CLI]
    S[internal service<br/>subagent / browser tool / cron]
    L[local operator<br/>same-host CLI]
  end

  U -.-> REQ
  S -.-> REQ
  L -.-> REQ

  style PATCH fill:#fff3cd,stroke:#856404
  style TFX fill:#f8d7da,stroke:#721c24
  style TPFX fill:#f8d7da,stroke:#721c24
Loading

Every red box is a patch added in response to one concern breaking, and every patch has at least once caused a different concern to break. #45264#54536#59167 is the cleanest example: each PR was correct in isolation for the case it targeted, and the composition shipped a real outage for trusted-proxy users.

Why the current hotfix wave isn't enough on its own

Five PRs are in flight, each extending the same decision tree with a different fallback path:

PR Approach Case it handles
#51070 Allow loopback to bypass trusted-proxy auth Any local loopback call
#54426 On localDirect trusted-proxy failure, fall through with method: "none" ACPX/CLI child processes
#59190 Accept loopback-proxy requests when X-Forwarded-For resolves to a non-loopback client Same-host reverse proxy fronting external traffic
#63379 Config-driven trustedProxy.allowLoopback + loopbackUser + skip requiredHeaders on loopback K8s / Docker sidecar + reverse proxy
#64122 Password fallback for local-direct clients when trusted-proxy identity is absent (non-proxy failures only) Existing deployments with a password already configured

Each is internally coherent. Merged together they interfere — different PRs put the fallback in different places with different trust assumptions guarded by different invariants. Merging any one of them makes the others harder to review, because the trust model has to be re-derived from scratch each time.

The short-term user pain is real and needs Phase 0 relief. The medium-term cost of continuing this pattern — more patches, more exceptions, more combinations nobody has tested — is what this RFC is trying to stop.

Proposal

Three axes, evaluated in order

authorizeGatewayConnect(req):
  1. svc = tryServiceAccount(req)           # NEW, orthogonal
     → if ok: return { method: "service", role: "backend" }

  2. usr = authorizeUser(req, auth.mode)    # existing modes, unchanged
     → if ok: return { method, role: "operator", user }

  3. reject

User-facing modes keep their current semantics. The new check runs first, is configured independently, and is the only path the gateway's own children use. Tightening either axis stops leaking into the other.

Service identity, concretely

  • Token material. Gateway generates gateway.service.token at first start, stored alongside the existing auto-generated password with owner-only permissions. Rotatable via openclaw gateway rotate-service-token. No operator action required.
  • Injection. The gateway already spawns its own children and already injects OPENCLAW_GATEWAY_PORT. Extend that with OPENCLAW_SERVICE_TOKEN. GatewayClient picks it up from env.
  • Client class. Children authenticate with the service token regardless of upstream auth.mode. External CLI callers still follow the existing resolution chain.
  • Scopes. backend role, with scope sets per caller class (operator.read/write for subagents and CLI, narrower for browser tool). No unbounded admin.
  • Surface. Loopback accepted unconditionally. Off-host accepted only with explicit opt-in (gateway.service.allowRemote: true) for multi-node deployments. Off-host default deny.

What this removes

With service identity in place, the localDirect special-casing inside trusted-proxy, the sharedAuthOk shortcut, and the password-fallback path all stop being necessary for internal callers. User auth stays focused on users. Service auth stays focused on the gateway's own children. The loopback guard in trusted-proxy can stay strict without breaking anything, because internal callers no longer depend on it.

What this does not change

  • gateway.auth.mode values, semantics, or on-the-wire behavior for user traffic.
  • Trusted-proxy header handling for actual reverse-proxied user traffic.
  • Device-pairing flow for operator browser sessions.
  • Tailscale header auth.

Phased plan

Ship value early. Keep each phase independently reviewable. Avoid a single large PR in this area — that pattern has a bad track record here.

Phase 0 — Hotfix triage (now)

Goal: unblock the people reporting breakage today without waiting for Phase 1.

Of the five in-flight PRs, two cover the two most-reported shapes with the clearest trust stories:

Recommendation: merge both as complementary opt-ins. #59190 can slot in alongside for same-host reverse proxies that forward real external traffic. #51070 and #54426 are superseded by Phase 1 and can close with a pointer to this RFC.

Phase 0 is explicitly tactical. Nothing in it constrains the Phase 1 design — all of it is removed cleanly in Phase 2.

Phase 1 — Internal service identity (first release after Phase 0)

Outcome: the whole class of "subagent/internal RPC fails with pairing required / trusted_proxy_loopback_source" reports disappears regardless of auth.mode.

Phase 2 — Retire legacy fallbacks

Gated on Phase 1 having been the default for at least one release. Deprecation notes in changelog, openclaw doctor flagging affected configs.

Phase 3 (optional, future)

Listed so Phase 1's design doesn't foreclose this, not committed to in this RFC.

Compatibility and migration

Non-goals

Open questions

References

Reports attributable to the current coupling
#26007 · #43300 · #59167 · #59045 · #59702 · #60265 · #62767 · #63381 · #63548 · #63344 · #67703 · #67524 · #67799 · #55218 · #46897 · #48847 · #49201 · #52647 · #57434 · #59882

Complementary concerns (same axis, different direction)
#57087 (external-side guardrails) · #63344 (local backend client class) · #43786 (auth.mode=none still required by some deployments) · #50751 (CLI host resolution) · #56982 (doctor output for trusted-proxy)

PRs this RFC would supersede or subsume (Phase 2 onward)
#51070 · #54426 · #59190 · #63379 · #64122

Context
#45264 · #54536 (regression boundary) · #44044 · #49107 · #33819 · #54718 · #9271

Ask

  1. Agreement in principle on the three-axis framing.
  2. Phase 0 triage decision: which of the in-flight PRs get merged now, which close as superseded.
  3. A maintainer willing to sponsor Phase 1 review. Author will drive the implementation.

cc @vincentkoc · @nickytonline · @mrosmarin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions