Skip to content

Fix local gateway detail probes on loopback#46556

Closed
jeffrey4341 wants to merge 8 commits into
openclaw:mainfrom
jeffrey4341:fix/local-gateway-probe-device-identity
Closed

Fix local gateway detail probes on loopback#46556
jeffrey4341 wants to merge 8 commits into
openclaw:mainfrom
jeffrey4341:fix/local-gateway-probe-device-identity

Conversation

@jeffrey4341

@jeffrey4341 jeffrey4341 commented Mar 14, 2026

Copy link
Copy Markdown

Summary

  • attach device identity for localhost CLI gateway calls even when a shared gateway token/password is present
  • keep gateway detail probes (status, system-presence, last-heartbeat) working in local token-auth mode
  • raise local loopback gateway probe budget from 800ms to 3000ms so full localhost detail probes do not false-timeout

Closes #46557.

Problem

On a local gateway configured with gateway.auth.mode=token, the CLI currently suppresses deviceIdentity on loopback calls whenever a shared token/password is present. That means simple health checks work, but detail RPCs that rely on paired operator scopes can fail with missing scope: operator.read.

Separately, openclaw gateway probe only budgets 800ms for the localLoopback target, which can be too short for a full detail probe even when the gateway is healthy.

Fix

  • remove the loopback special case that disabled deviceIdentity for local token/password gateway calls
  • do the same for explicit gateway probes
  • increase localLoopback probe budget to 3000ms
  • update/add tests covering both behaviors

Tests

  • corepack pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/call.test.ts src/gateway/probe.test.ts
  • corepack pnpm exec vitest run src/commands/gateway-status/helpers.test.ts

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime commands Command implementations size: S labels Mar 14, 2026
@greptile-apps

greptile-apps Bot commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes two related issues with local gateway probes in token-auth mode: device identity was being suppressed for loopback calls (causing missing scope: operator.read errors on detail RPCs), and the loopback probe budget was too tight at 800 ms for full detail probes. The fix removes the loopback carve-out in both call.ts and probe.ts and raises the localLoopback cap in resolveProbeBudgetMs from 800 ms to 3000 ms.

  • call.tsshouldAttachDeviceIdentityForGatewayCall now unconditionally returns true; its parameter is accepted but unused (marked _params). The function is essentially dead code at this point and could be inlined or simplified.
  • probe.ts — the disableDeviceIdentity IIFE and its explicit null override are removed, leaving deviceIdentity unset (default) for all probe targets; isLoopbackHost import is correctly cleaned up.
  • helpers.tslocalLoopback probe budget cap raised from 800 ms → 3000 ms with a clear explanatory comment.
  • Tests — all three changed test files correctly reflect the new behaviour. One new test in probe.test.ts has a misleading description ("keeps device identity enabled for remote probes") when the assertion actually checks for undefined, not an explicitly enabled value.

Confidence Score: 4/5

  • Safe to merge; changes are narrow and well-tested with only minor style issues remaining.
  • All three functional changes (device-identity suppression removal in call.ts, same in probe.ts, budget bump in helpers.ts) are small, clearly motivated, and covered by updated or new tests. The only concerns are non-blocking: the shouldAttachDeviceIdentityForGatewayCall wrapper accepting unused parameters, and a misleading test description in probe.test.ts. No logic errors or security regressions detected.
  • src/gateway/call.ts — shouldAttachDeviceIdentityForGatewayCall accepts _params but ignores them entirely; consider simplifying.

Comments Outside Diff (1)

  1. src/gateway/probe.test.ts, line 64-72 (link)

    Misleading test description

    The test is named "keeps device identity enabled for remote probes" but then asserts deviceIdentity is undefined (i.e. not explicitly set). In the GatewayClient options, undefined means "use default behaviour" — not "enabled". The assertion is correct, but the description contradicts what it is actually checking and could confuse future readers.

    A clearer name would be something like "does not explicitly disable device identity for remote probes" or "leaves device identity as default for remote probes".

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/gateway/probe.test.ts
    Line: 64-72
    
    Comment:
    **Misleading test description**
    
    The test is named "keeps device identity **enabled** for remote probes" but then asserts `deviceIdentity` is `undefined` (i.e. not explicitly set). In the GatewayClient options, `undefined` means "use default behaviour" — not "enabled". The assertion is correct, but the description contradicts what it is actually checking and could confuse future readers.
    
    A clearer name would be something like `"does not explicitly disable device identity for remote probes"` or `"leaves device identity as default for remote probes"`.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/call.ts
Line: 84-93

Comment:
**Dead function — parameter is accepted but fully ignored**

`shouldAttachDeviceIdentityForGatewayCall` now accepts `_params` (url, token, password) and unconditionally returns `true`, making the wrapper indistinguishable from an inline `true`. The underscore prefix signals the parameter is intentionally unused, but keeping the function alive adds indirection without benefit. Consider either inlining the call site or simplifying the signature to make the intent explicit:

```suggestion
function shouldAttachDeviceIdentityForGatewayCall(): boolean {
  // Always attach device identity so paired operator scopes remain
  // available for detail RPCs such as status / system-presence / last-heartbeat,
  // even when local CLI calls authenticate with a shared gateway token/password.
  return true;
}
```

And update the call site accordingly:

```ts
deviceIdentity: shouldAttachDeviceIdentityForGatewayCall()
  ? loadOrCreateDeviceIdentity()
  : undefined,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gateway/probe.test.ts
Line: 64-72

Comment:
**Misleading test description**

The test is named "keeps device identity **enabled** for remote probes" but then asserts `deviceIdentity` is `undefined` (i.e. not explicitly set). In the GatewayClient options, `undefined` means "use default behaviour" — not "enabled". The assertion is correct, but the description contradicts what it is actually checking and could confuse future readers.

A clearer name would be something like `"does not explicitly disable device identity for remote probes"` or `"leaves device identity as default for remote probes"`.

```suggestion
  it("does not explicitly disable device identity for remote probes", async () => {
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 7bbaad2

Comment thread src/gateway/call.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

const client = new GatewayClient({
url: opts.url,
token: opts.auth?.token,
password: opts.auth?.password,
scopes: [READ_SCOPE],

P2 Badge Keep lightweight probes independent of device identity

By dropping the loopback deviceIdentity: null override in probeGateway, even includeDetails: false probes now instantiate GatewayClient with its default identity loading path. That makes the lightweight reachability path used by src/cli/daemon-cli/restart-health.ts depend on creating/reading identity files, so on hosts with unwritable state dirs the probe throws before handshake and the restart health check can incorrectly treat a healthy local gateway as unhealthy. Since this mode intentionally skips detail RPCs, it should not require device-identity setup.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label Mar 14, 2026
@heavenlost

Copy link
Copy Markdown
Contributor

Validated locally: removing the loopback deviceIdentity: null path fixes the false missing operator.read degradation on local probe.

What we confirmed

Before local patch:

  • local paired device auth already had full operator scopes
  • direct/local RPC worked
  • gateway status --json could show rpc.ok = true
  • but gateway probe --json still degraded to missing operator.read

After local patch:

  • gateway probe --json returned:
    • ok = true
    • degraded = false
    • connect.rpcOk = true

Additional observation from post-update automation

After gateway restart, probe can still be a little cold-start-sensitive for a short window.

In our post-update hook, the most reliable behavior was:

  • longer timeout
  • retry (instead of single-shot probe immediately after restart)

So this PR direction looks right, and I’d strongly support also keeping an eye on the local-loopback probe budget / cold-start behavior.

In short:

  1. keep device identity enabled for authenticated loopback probe
  2. keep/raise local detail probe budget as needed
  3. post-restart automation benefits from retry

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5207d1fc4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/gateway/probe.ts Outdated
mode: GATEWAY_CLIENT_MODES.PROBE,
instanceId,
deviceIdentity: disableDeviceIdentity ? null : undefined,
deviceIdentity: opts.includeDetails === false ? null : undefined,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fall back when local probe identity cannot be persisted

probeGateway now sends deviceIdentity: undefined for detail probes, which makes GatewayClient eagerly call loadOrCreateDeviceIdentity() (src/gateway/client.ts:146-149). On first run in environments where the state dir is not writable (for example read-only HOME in CI/containers), that helper throws while writing identity/device.json (src/infra/device-identity.ts:108-117), so openclaw gateway status/detail probes fail before any RPC even when token/password auth is valid. Previously loopback probes passed null and avoided this failure path.

Useful? React with 👍 / 👎.

@openclaw-barnacle openclaw-barnacle Bot removed the agents Agent runtime and tooling label Mar 15, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2e8a493a3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/gateway/call.ts Outdated
Comment on lines +84 to +88
function shouldAttachDeviceIdentityForGatewayCall(): boolean {
// Even when local CLI calls authenticate with a shared gateway token/password,
// we still want to attach device identity so paired operator scopes remain
// available for detail RPCs such as status / system-presence / last-heartbeat.
return true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add fallback for device identity write failures

shouldAttachDeviceIdentityForGatewayCall() now always returns true, so local loopback calls that use shared token/password always execute loadOrCreateDeviceIdentity(). In read-only or restricted state directories (for example CI containers with unwritable HOME), that helper throws while writing identity/device.json, causing callGateway to fail before any RPC even though token/password auth is valid; this regresses local token-auth CLI calls compared to the prior loopback bypass and unlike probeGateway, which now explicitly falls back to null on this failure path.

Useful? React with 👍 / 👎.

@rory78669-bot

Copy link
Copy Markdown

I hit this locally on 2026.3.13 and dug it all the way down to the shipped dist/runtime behavior. This PR is aimed at the right bug.

High-signal findings from runtime reproduction:

  • The actual CLI gateway probe path goes through dist/probe-auth-*.js, not the simpler helper path I first expected.
  • In that probe helper, loopback probes explicitly set deviceIdentity: null while requesting [READ_SCOPE] and then issue health, status, system-presence, and config.get on the same probe connection.
  • Repro split was clean:
    • probe-like GatewayClient with auto device identity => works
    • same probe-like client with deviceIdentity: null => connects, then fails follow-up RPC with INVALID_REQUEST missing scope: operator.read
  • Calling the shipped exported probeGateway() directly against ws://127.0.0.1:18789 with a fresh token returned:
    • ok: false
    • connectLatencyMs: 2667
    • error: "missing scope: operator.read"

That exposed the second bug: the CLI status/probe wrapper was masking the real failure with an artificial timeout because of the per-target budget caps. On my box, the detail probe took ~2.6s, so the existing 800ms/1500ms caps could turn the real scope-limited failure into a fake timeout / latencyMs:null result.

I applied a local dist patch equivalent to the intent of this PR:

  • stop nulling loopback probe device identity
  • raise probe budget

After that:

  • openclaw gateway probe --url ws://127.0.0.1:18789 --token ... --timeout 15000 --json returned ok:true, rpcOk:true, scopeLimited:false, latencyMs:2596
  • openclaw status flipped from unreachable (missing scope: operator.read) to a normal reachable gateway line

So from a live installed-runtime perspective: yes, this is the right fix direction.

One practical note: on this machine, ~3000ms would likely pass, but only barely. I saw ~2596ms on a healthy localhost detailed probe, so any extra noise could make that threshold a little tight.

@rory78669-bot

Copy link
Copy Markdown

One more thing after reading the actual diff: I think there are still two gaps here.

1) The budget bump only fixes localLoopback, not explicit loopback URLs

resolveProbeBudgetMs() in this PR changes only:

  • localLoopback: 800ms -> 3000ms

but leaves:

  • explicit: 1500ms
  • sshTunnel: 2000ms

That matters because a very normal repro path is:

openclaw gateway probe --url ws://127.0.0.1:18789 --token ...

That target is explicit, not localLoopback.

On my machine, the real detailed probe against loopback took about 2596ms once the device-identity issue was fixed. So with this PR as written, openclaw status may improve while explicit loopback probe can still false-timeout simply because the explicit cap remains at 1500ms.

I’d strongly suggest either:

  • treating explicit loopback URLs like localLoopback, or
  • just raising the explicit budget too

because otherwise the fix is only partial.

2) call.ts now always loads device identity, but without the fallback that probe.ts has

In probe.ts, this PR now does the right thing for restricted/read-only environments:

  • try loadOrCreateDeviceIdentity()
  • if that throws, fall back to null

But in call.ts, the new behavior is:

deviceIdentity: shouldAttachDeviceIdentityForGatewayCall()
  ? loadOrCreateDeviceIdentity()
  : undefined,

and shouldAttachDeviceIdentityForGatewayCall() now always returns true.

So detail calls improve, but the call path still looks harsher than probe in environments where identity persistence is unavailable. That may be fine if there is a higher-level catch I missed, but at minimum it’s an asymmetry worth double-checking before merge.

Bottom line

I agree with the core fix direction.

I just don’t think localLoopback -> 3000ms is enough on its own, because explicit loopback URLs are still on the old short leash, and those are part of the real repro surface.

@heavenlost

heavenlost commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

@jeffrey4341 Thanks — I agreed with both gaps and pushed a verified follow-up here:

This follow-up:

  1. treats effective loopback URLs like local loopback for probe budgeting, so explicit --url ws://127.0.0.1:18789 gets the same 3000ms detail budget;
  2. aligns call.ts with probe.ts by falling back to deviceIdentity: null when identity persistence fails, instead of throwing pre-RPC in read-only / restricted environments;
  3. adds coverage for explicit loopback budgeting and the call.ts identity-persistence failure path.

Verified with:

  • pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/call.test.ts src/gateway/probe.test.ts
  • pnpm exec vitest run src/commands/gateway-status/helpers.test.ts

I couldn't push directly to jeffrey4341:fix/local-gateway-probe-device-identity from this account, so this is ready to cherry-pick / merge from the branch above.

ClaudeZackaryFair and others added 2 commits March 16, 2026 10:42
@jeffrey4341

Copy link
Copy Markdown
Author

Picked this up and integrated the verified follow-up from @heavenlost into this PR branch.

What I updated:

  • included the explicit loopback budget handling so --url ws://127.0.0.1:18789 gets the same local detail probe budget treatment;
  • aligned call.ts with the probe path so identity persistence failure falls back to deviceIdentity: null instead of failing pre-RPC;
  • merged the latest origin/main into this branch and resolved the resulting conflicts.

Local verification after the update:

  • npx vitest run --config vitest.gateway.config.ts src/gateway/call.test.ts src/gateway/probe.test.ts
  • npx vitest run src/commands/gateway-status/helpers.test.ts
  • npx vitest run src/tts/tts.test.ts

All passed locally. CI is rerunning on the updated branch now.

@heavenlost

Copy link
Copy Markdown
Contributor

I dug through the PR checks against the base main state from March 16, 2026.

The red CI on this PR does not look specific to the gateway change itself. This branch appears to be sitting on top of a red baseline.

Evidence:

  • PR run 23139899782 (this PR head) failed startup-memory with status --json used 2200.9 MB RSS (limit 925 MB).
  • main run 23139854942 on March 16, 2026 also failed startup-memory with status --json used 2220.4 MB RSS (limit 925 MB).

That same March 16, 2026 main run was already red across several unrelated jobs (channels, extensions, loader, multiple Windows shards, etc.), which matches most of what is failing on this PR as well.

As of March 20, 2026, main CI is green again:

  • CI run 23341394313

Also, the loopback probe / device-identity fix still does not appear to be present in current main, so this does not look like an already-absorbed fix.

My suggestion: instead of reasoning from the March 16 red run, reapply or rebase the final gateway-only diff onto current green main and rerun CI. That should give a clean signal on the actual gateway change.

References:

@heavenlost

Copy link
Copy Markdown
Contributor

Opened a clean replacement PR based on current green main: #51087

Link: #51087

This carries only the gateway-related diff from this branch and avoids the stale red CI baseline from March 16, 2026.

Targeted local verification on the replacement branch passed:

  • node node_modules/vitest/dist/cli.js run src/gateway/call.test.ts src/gateway/probe.test.ts src/commands/gateway-status/helpers.test.ts
  • 3 files, 74 tests passed

@steipete

Copy link
Copy Markdown
Contributor

Closing this as implemented after Codex review.

Current main already contains the loopback gateway probe/device-identity fixes this PR proposed, with matching regression coverage, so this older PR is superseded by the implementation now in tree.

What I checked:

  • Gateway call path now preserves device identity and falls back safely: resolveDeviceIdentityForGatewayCall() loads device identity for CLI gateway calls and returns null on persistence failure; executeGatewayRequestWithScopes() uses that value by default for operator calls. (src/gateway/call.ts:196, 56fe2aab9c87)
  • Probe path keeps authenticated detail probes device-bound: probeGateway() now loads device identity for authenticated probes, keeps unauthenticated loopback probes at null, and falls back to null if identity persistence fails before constructing GatewayClient. (src/gateway/probe.ts:148, 56fe2aab9c87)
  • Loopback probe budgets are already widened on main: resolveProbeBudgetMs() treats explicit loopback URLs as loopback targets and lets active/discovered loopback probes honor the full caller budget, while preserving the short cap only for inactive local loopback and the existing remote/SSH caps. (src/commands/gateway-status/helpers.ts:121, 56fe2aab9c87)
  • Current tests cover the reported regressions: Main includes tests for local shared-token call identity, probe identity fallback, and explicit loopback budgeting. (src/gateway/call.test.ts:306, 56fe2aab9c87)

So I’m closing this as already implemented rather than keeping a duplicate issue open.

Review notes: reviewed against 56fe2aab9c87.

@steipete steipete closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands Command implementations gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLI local gateway detail probes drop device auth on localhost and localLoopback probe budget is too low

4 participants