[RFC] Separate internal service identity from user auth in OpenClaw gateway

# [RFC] Separate internal service identity from user auth in the OpenClaw gateway

**Labels suggestion:** `rfc`, `area:gateway`, `area:auth`, `discussion`, `trusted-proxy`
**Keywords (for search):** gateway auth, trusted-proxy, service identity, service account, loopback, localDirect, internal RPC, subagent auth, pairing, `trusted_proxy_loopback_source`

---

## Summary

`gateway.auth.mode` is doing three jobs at once:

1. **User auth** — who is the human on the other end (`none` / `token` / `trusted-proxy` / Tailscale).
2. **Internal service auth** — who is the gateway's own child process (subagents, browser tool, cron, exec-approvals, CLI on the same host).
3. **Local operator auth** — who is the person at `127.0.0.1` running `openclaw status`, `openclaw cron list`, etc.

All three flow through the same branch in `authorizeGatewayConnect`, gated by `isLocalDirectRequest`, the loopback guard in `authorizeTrustedProxy`, `sharedAuthOk`, `controlUi.allowInsecureAuth`, and pairing/device-identity checks — layered over time as exceptions. Every pass that tightens one of these knobs has ended up loosening or breaking a different one. The March–April 2026 trusted-proxy regressions (#59167, #43300, #26007, #59045, #59702, #60265, #62767, #63381, #63548, #63344, #67703, #67524, #67799) are all the same underlying coupling surfacing in different deployment shapes.

This RFC proposes making internal service identity a first-class, orthogonal axis, leaving existing user-auth modes untouched, and outlines a phased path that ships value week-one and does not require a single large PR.

Happy to drive the implementation — looking for agreement on shape and phasing first so the five in-flight PRs can be sequenced instead of racing each other.

---

## The problem, in one picture

```mermaid
flowchart TD
 REQ[Incoming WS connection] --> MODE{auth.mode?}

 MODE -->|none| N[accept]
 MODE -->|token| T{token valid?}
 MODE -->|trusted-proxy| TP{trusted source? loopback guard? required headers? allowed user?}
 MODE -->|tailscale| TS[tailscale flow]

 T -->|no| TFX{localDirect? sharedAuthOk? allowInsecureAuth? device paired?}
 TP -->|fail| TPFX{localDirect? non-proxy failure? password set? loopback allowed? loopbackUser set?}

 subgraph PATCH[same decision tree reached by 3 different concerns]
 direction LR
 U[external user browser / remote CLI]
 S[internal service subagent / browser tool / cron]
 L[local operator same-host CLI]
 end

 U -.-> REQ
 S -.-> REQ
 L -.-> REQ

 style PATCH fill:#fff3cd,stroke:#856404
 style TFX fill:#f8d7da,stroke:#721c24
 style TPFX fill:#f8d7da,stroke:#721c24
```

Every red box is a patch added in response to one concern breaking, and every patch has at least once caused a different concern to break. #45264 → #54536 → #59167 is the cleanest example: each PR was correct in isolation for the case it targeted, and the composition shipped a real outage for trusted-proxy users.

## Why the current hotfix wave isn't enough on its own

Five PRs are in flight, each extending the same decision tree with a different fallback path:

| PR | Approach | Case it handles |
|---|---|---|
| #51070 | Allow loopback to bypass trusted-proxy auth | Any local loopback call |
| #54426 | On `localDirect` trusted-proxy failure, fall through with `method: "none"` | ACPX/CLI child processes |
| #59190 | Accept loopback-proxy requests when `X-Forwarded-For` resolves to a non-loopback client | Same-host reverse proxy fronting external traffic |
| #63379 | Config-driven `trustedProxy.allowLoopback` + `loopbackUser` + skip `requiredHeaders` on loopback | K8s / Docker sidecar + reverse proxy |
| #64122 | Password fallback for local-direct clients when trusted-proxy identity is absent (non-proxy failures only) | Existing deployments with a password already configured |

Each is internally coherent. Merged together they interfere — different PRs put the fallback in different places with different trust assumptions guarded by different invariants. Merging any one of them makes the others harder to review, because the trust model has to be re-derived from scratch each time.

The short-term user pain is real and needs Phase 0 relief. The medium-term cost of continuing this pattern — more patches, more exceptions, more combinations nobody has tested — is what this RFC is trying to stop.

## Proposal

### Three axes, evaluated in order

```
authorizeGatewayConnect(req):
 1. svc = tryServiceAccount(req) # NEW, orthogonal
 → if ok: return { method: "service", role: "backend" }

 2. usr = authorizeUser(req, auth.mode) # existing modes, unchanged
 → if ok: return { method, role: "operator", user }

 3. reject
```

User-facing modes keep their current semantics. The new check runs first, is configured independently, and is the only path the gateway's own children use. Tightening either axis stops leaking into the other.

### Service identity, concretely

- **Token material.** Gateway generates `gateway.service.token` at first start, stored alongside the existing auto-generated password with owner-only permissions. Rotatable via `openclaw gateway rotate-service-token`. No operator action required.
- **Injection.** The gateway already spawns its own children and already injects `OPENCLAW_GATEWAY_PORT`. Extend that with `OPENCLAW_SERVICE_TOKEN`. `GatewayClient` picks it up from env.
- **Client class.** Children authenticate with the service token regardless of upstream `auth.mode`. External CLI callers still follow the existing resolution chain.
- **Scopes.** `backend` role, with scope sets per caller class (operator.read/write for subagents and CLI, narrower for browser tool). No unbounded admin.
- **Surface.** Loopback accepted unconditionally. Off-host accepted only with explicit opt-in (`gateway.service.allowRemote: true`) for multi-node deployments. Off-host default deny.

### What this removes

With service identity in place, the `localDirect` special-casing inside trusted-proxy, the `sharedAuthOk` shortcut, and the password-fallback path all stop being necessary for internal callers. User auth stays focused on users. Service auth stays focused on the gateway's own children. The loopback guard in trusted-proxy can stay strict without breaking anything, because internal callers no longer depend on it.

### What this does not change

- `gateway.auth.mode` values, semantics, or on-the-wire behavior for user traffic.
- Trusted-proxy header handling for actual reverse-proxied user traffic.
- Device-pairing flow for operator browser sessions.
- Tailscale header auth.

## Phased plan

Ship value early. Keep each phase independently reviewable. Avoid a single large PR in this area — that pattern has a bad track record here.

### Phase 0 — Hotfix triage (now)

Goal: unblock the people reporting breakage today without waiting for Phase 1.

Of the five in-flight PRs, two cover the two most-reported shapes with the clearest trust stories:

- **#64122** — password fallback with `isNonProxyFailure && localDirect` gating plus rate-limiting. Zero config change required for the existing broken shape (deployment already has a password). Smallest diff.
- **#63379** — `allowLoopback` + `loopbackUser`, explicit opt-in, closes the original feature request (#26007, #43300). Right shape for K8s / Docker sidecar deployments where no password exists.

Recommendation: merge both as complementary opt-ins. #59190 can slot in alongside for same-host reverse proxies that forward real external traffic. #51070 and #54426 are superseded by Phase 1 and can close with a pointer to this RFC.

Phase 0 is explicitly tactical. Nothing in it constrains the Phase 1 design — all of it is removed cleanly in Phase 2.

### Phase 1 — Internal service identity (first release after Phase 0)

- Auto-generate and persist `gateway.service.token` at first start.
- `GatewayClient` picks up the token from env when spawned by the gateway.
- `authorizeGatewayConnect` gains a service-identity check that runs before user auth. Loopback-only in this phase.
- Migrate known internal callers: subagent, browser tool, cron, exec-approvals, Telegram native approvals.
- Feature-flagged (`gateway.service.enabled`, default `false` → `true` after one release) to catch platform surprises before defaulting on. Windows in particular has live issues around token persistence (#53742, #66038, #61340, #67595) that we should not inherit.

Outcome: the whole class of "subagent/internal RPC fails with `pairing required` / `trusted_proxy_loopback_source`" reports disappears regardless of `auth.mode`.

### Phase 2 — Retire legacy fallbacks

- Remove `localDirect` fallback inside the trusted-proxy branch.
- Retire the Phase 0 hotfix paths.
- Harden the trusted-proxy external story (#57087): explicit `insecureAllowNoUpstreamAuth: true` opt-in plus runtime warnings, now possible without breaking internal callers.

Gated on Phase 1 having been the default for at least one release. Deprecation notes in changelog, `openclaw doctor` flagging affected configs.

### Phase 3 (optional, future)

- Multiple service principals with distinct scope sets (compromised subagent shouldn't inherit exec-approval authority).
- Rotation hooks, audit log entries, metrics.
- Cross-node service auth for multi-node deployments (#33819, #54718).

Listed so Phase 1's design doesn't foreclose this, not committed to in this RFC.

## Compatibility and migration

- Existing `auth.mode` configs work unchanged in every phase.
- Phase 1 is additive. `gateway.service.enabled: false` reproduces today's behavior exactly.
- Phase 2 is the only phase that removes behavior, gated on at least one release of Phase 1 being live. Docs + doctor warnings at each step.
- Trusted-proxy docs get the #57087 guardrail pass during Phase 2.

## Non-goals

- Zero-trust outbound secrets handling (#9271 territory).
- Multi-user RBAC (#8081 territory). Phase 3 may enable it but it is not a goal here.
- Redesigning device pairing. Pairing stays as-is for operator browser sessions; service identity routes around it for internal callers, which is the minimal change.

## Open questions

- Service token storage on Windows given plist / systemd pollution patterns already seen (#53742, #66038, #61340, #67595, #42808). Leaning toward an ACL-guarded file under `%LOCALAPPDATA%`; want a second opinion from people who hit those bugs.
- `openclaw gateway install --force` behavior: regenerate or preserve the service token? Current token-handling behavior (#67595) suggests preserve, with an explicit `--rotate-service-token` opt-in.
- Managed-hosting recoverability (#62767 — no SSH access). Service token needs a recovery story that doesn't require console.
- Multi-node shape (#33819): Phase 3, or should Phase 1 already ship something that generalizes?
- Naming. `service-account` carries K8s / GCP baggage; `service-identity` is more neutral but less discoverable. Bikeshed welcome.

## References

**Reports attributable to the current coupling**
#26007 · #43300 · #59167 · #59045 · #59702 · #60265 · #62767 · #63381 · #63548 · #63344 · #67703 · #67524 · #67799 · #55218 · #46897 · #48847 · #49201 · #52647 · #57434 · #59882

**Complementary concerns (same axis, different direction)**
#57087 (external-side guardrails) · #63344 (local backend client class) · #43786 (auth.mode=none still required by some deployments) · #50751 (CLI host resolution) · #56982 (doctor output for trusted-proxy)

**PRs this RFC would supersede or subsume (Phase 2 onward)**
#51070 · #54426 · #59190 · #63379 · #64122

**Context**
#45264 · #54536 (regression boundary) · #44044 · #49107 · #33819 · #54718 · #9271

## Ask

1. Agreement in principle on the three-axis framing.
2. Phase 0 triage decision: which of the in-flight PRs get merged now, which close as superseded.
3. A maintainer willing to sponsor Phase 1 review. Author will drive the implementation.

cc @vincentkoc · @nickytonline · @mrosmarin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Separate internal service identity from user auth in OpenClaw gateway #69066

[RFC] Separate internal service identity from user auth in the OpenClaw gateway

Summary

The problem, in one picture

Why the current hotfix wave isn't enough on its own

Proposal

Three axes, evaluated in order

Service identity, concretely

What this removes

What this does not change

Phased plan

Phase 0 — Hotfix triage (now)

Phase 1 — Internal service identity (first release after Phase 0)

Phase 2 — Retire legacy fallbacks

Phase 3 (optional, future)

Compatibility and migration

Non-goals

Open questions

References

Ask

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PR	Approach	Case it handles
#51070	Allow loopback to bypass trusted-proxy auth	Any local loopback call
#54426	On `localDirect` trusted-proxy failure, fall through with `method: "none"`	ACPX/CLI child processes
#59190	Accept loopback-proxy requests when `X-Forwarded-For` resolves to a non-loopback client	Same-host reverse proxy fronting external traffic
#63379	Config-driven `trustedProxy.allowLoopback` + `loopbackUser` + skip `requiredHeaders` on loopback	K8s / Docker sidecar + reverse proxy
#64122	Password fallback for local-direct clients when trusted-proxy identity is absent (non-proxy failures only)	Existing deployments with a password already configured

Uh oh!

[RFC] Separate internal service identity from user auth in OpenClaw gateway #69066

Description

[RFC] Separate internal service identity from user auth in the OpenClaw gateway

Summary

The problem, in one picture

Why the current hotfix wave isn't enough on its own

Proposal

Three axes, evaluated in order

Service identity, concretely

What this removes

What this does not change

Phased plan

Phase 0 — Hotfix triage (now)

Phase 1 — Internal service identity (first release after Phase 0)

Phase 2 — Retire legacy fallbacks

Phase 3 (optional, future)

Compatibility and migration

Non-goals

Open questions

References

Ask

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions