fix(gateway): use configured probe auth during restart health checks#57374
fix(gateway): use configured probe auth during restart health checks#57374zssggle-rgb wants to merge 1 commit into
Conversation
Greptile SummaryThis PR fixes a real auth mismatch in gateway restart health checks: One incomplete fix:
Minor:
Confidence Score: 4/5Mostly safe to merge, but one restart code path (unmanaged SIGUSR1 restarts) still uses env-only auth due to probeAuth resolution being hoisted too late. The fix is correct and complete for service-managed restarts and the status path. However, the waitForGatewayHealthyListener call in the restartedWithoutServiceManager branch of lifecycle.ts never receives probeAuth because resolveGatewayRestartProbeAuth() is invoked after that branch returns, leaving the SIGUSR1 unmanaged restart path with the original bug. A one-liner hoist resolves it, so this P1 should be addressed before merging. src/cli/daemon-cli/lifecycle.ts — the waitForGatewayHealthyListener call inside the restartedWithoutServiceManager branch needs probeAuth.
|
| Filename | Overview |
|---|---|
| src/cli/daemon-cli/lifecycle.ts | Adds resolveGatewayRestartProbeAuth and threads probeAuth into service-managed restart health checks, but the unmanaged (SIGUSR1) restart path still calls waitForGatewayHealthyListener without probeAuth. |
| src/cli/daemon-cli/restart-health.ts | Exports GatewayProbeAuth type and correctly plumbs the optional probeAuth parameter through all four exported health-check helpers and their internal helpers. |
| src/cli/daemon-cli/restart-health.test.ts | Adds a focused regression test confirming probeGateway is called with the configured token; covers the managed restart path only. |
| src/cli/daemon-cli/status.gather.ts | One-line change passes the already-resolved daemonProbeAuth into inspectGatewayRestart; straightforward and correct. |
| src/cli/update-cli/update-command.ts | Adds resolveUpdateRestartProbeAuth (duplicate of the helper in lifecycle.ts) and threads probeAuth into both waitForGatewayHealthyRestart call sites in the update restart flow; logic is correct but the helper is a duplication. |
Comments Outside Diff (1)
-
src/cli/daemon-cli/lifecycle.ts, line 196-200 (link)Unmanaged restart path misses
probeAuthThe fix resolves auth from config and threads it through
waitForGatewayHealthyRestartfor service-managed restarts, but thewaitForGatewayHealthyListenercall used whenrestartedWithoutServiceManageris true (SIGUSR1-triggered restarts) is still missingprobeAuth. TheresolveGatewayRestartProbeAuth()call on line 223 is only reached after this early-return branch exits, so unmanaged restarts still probe with env-only auth — the exact bug this PR set out to fix.To cover both paths, hoist the
probeAuthresolution above theif (restartedWithoutServiceManager)guard and pass it intowaitForGatewayHealthyListener:const probeAuth = await resolveGatewayRestartProbeAuth(); if (restartedWithoutServiceManager) { const health = await waitForGatewayHealthyListener({ port: restartPort, attempts: POST_RESTART_HEALTH_ATTEMPTS, delayMs: POST_RESTART_HEALTH_DELAY_MS, probeAuth, }); ... }Prompt To Fix With AI
This is a comment left during a code review. Path: src/cli/daemon-cli/lifecycle.ts Line: 196-200 Comment: **Unmanaged restart path misses `probeAuth`** The fix resolves auth from config and threads it through `waitForGatewayHealthyRestart` for service-managed restarts, but the `waitForGatewayHealthyListener` call used when `restartedWithoutServiceManager` is true (SIGUSR1-triggered restarts) is still missing `probeAuth`. The `resolveGatewayRestartProbeAuth()` call on line 223 is only reached after this early-return branch exits, so unmanaged restarts still probe with env-only auth — the exact bug this PR set out to fix. To cover both paths, hoist the `probeAuth` resolution above the `if (restartedWithoutServiceManager)` guard and pass it into `waitForGatewayHealthyListener`: ``` const probeAuth = await resolveGatewayRestartProbeAuth(); if (restartedWithoutServiceManager) { const health = await waitForGatewayHealthyListener({ port: restartPort, attempts: POST_RESTART_HEALTH_ATTEMPTS, delayMs: POST_RESTART_HEALTH_DELAY_MS, probeAuth, }); ... } ``` How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 196-200
Comment:
**Unmanaged restart path misses `probeAuth`**
The fix resolves auth from config and threads it through `waitForGatewayHealthyRestart` for service-managed restarts, but the `waitForGatewayHealthyListener` call used when `restartedWithoutServiceManager` is true (SIGUSR1-triggered restarts) is still missing `probeAuth`. The `resolveGatewayRestartProbeAuth()` call on line 223 is only reached after this early-return branch exits, so unmanaged restarts still probe with env-only auth — the exact bug this PR set out to fix.
To cover both paths, hoist the `probeAuth` resolution above the `if (restartedWithoutServiceManager)` guard and pass it into `waitForGatewayHealthyListener`:
```
const probeAuth = await resolveGatewayRestartProbeAuth();
if (restartedWithoutServiceManager) {
const health = await waitForGatewayHealthyListener({
port: restartPort,
attempts: POST_RESTART_HEALTH_ATTEMPTS,
delayMs: POST_RESTART_HEALTH_DELAY_MS,
probeAuth,
});
...
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 47-65
Comment:
**Duplicated auth resolution helper**
`resolveGatewayRestartProbeAuth` in `lifecycle.ts` is byte-for-byte identical to `resolveUpdateRestartProbeAuth` added in `update-command.ts`. Both functions read the best-effort config, call `resolveGatewayProbeAuthSafeWithSecretInputs`, and merge with env-var fallbacks. Extracting this into a shared utility (e.g. in `probe-auth.ts` or a new `restart-probe-auth.ts`) would avoid the two implementations drifting apart over time.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: use configured gateway auth during ..." | Re-trigger Greptile
| async function resolveGatewayRestartProbeAuth() { | ||
| const fallbackAuth = { | ||
| token: process.env.OPENCLAW_GATEWAY_TOKEN?.trim() || undefined, | ||
| password: process.env.OPENCLAW_GATEWAY_PASSWORD?.trim() || undefined, | ||
| }; | ||
| const cfg = await readBestEffortConfig().catch(() => undefined); | ||
| if (!cfg) { | ||
| return fallbackAuth; | ||
| } | ||
| const { auth } = await resolveGatewayProbeAuthSafeWithSecretInputs({ | ||
| cfg, | ||
| mode: cfg.gateway?.mode === "remote" ? "remote" : "local", | ||
| env: process.env, | ||
| }); | ||
| return { | ||
| token: auth.token ?? fallbackAuth.token, | ||
| password: auth.password ?? fallbackAuth.password, | ||
| }; | ||
| } |
There was a problem hiding this comment.
Duplicated auth resolution helper
resolveGatewayRestartProbeAuth in lifecycle.ts is byte-for-byte identical to resolveUpdateRestartProbeAuth added in update-command.ts. Both functions read the best-effort config, call resolveGatewayProbeAuthSafeWithSecretInputs, and merge with env-var fallbacks. Extracting this into a shared utility (e.g. in probe-auth.ts or a new restart-probe-auth.ts) would avoid the two implementations drifting apart over time.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 47-65
Comment:
**Duplicated auth resolution helper**
`resolveGatewayRestartProbeAuth` in `lifecycle.ts` is byte-for-byte identical to `resolveUpdateRestartProbeAuth` added in `update-command.ts`. Both functions read the best-effort config, call `resolveGatewayProbeAuthSafeWithSecretInputs`, and merge with env-var fallbacks. Extracting this into a shared utility (e.g. in `probe-auth.ts` or a new `restart-probe-auth.ts`) would avoid the two implementations drifting apart over time.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Fixed in 310fc4da88. I extracted the shared local restart-probe auth resolution into src/gateway/probe-auth.ts and switched both lifecycle.ts and update-command.ts to use it, so the restart-health paths no longer duplicate the helper logic.
Also verified with targeted tests, lint, and tsgo.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c0001b0289
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| const { auth } = await resolveGatewayProbeAuthSafeWithSecretInputs({ | ||
| cfg, | ||
| mode: cfg.gateway?.mode === "remote" ? "remote" : "local", |
There was a problem hiding this comment.
Force local auth mode when resolving restart probe credentials
openclaw gateway restart always probes the local daemon (ws://127.0.0.1 in restart-health.ts), but this resolver switches to remote credential selection whenever gateway.mode=remote. In that config, resolveGatewayProbeAuthSafeWithSecretInputs can return gateway.remote.* credentials, which then override OPENCLAW_GATEWAY_TOKEN/PASSWORD and cause false timeout failures even when the local daemon restarted correctly. Restart health checks should resolve credentials in local mode regardless of client routing mode.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 310fc4da88. Restart health checks now resolve probe credentials in local mode regardless of gateway.mode, since the probe always targets the local daemon on 127.0.0.1.
I also kept the env fallback behavior and added coverage for the local-mode resolution path.
| } | ||
| const { auth } = await resolveGatewayProbeAuthSafeWithSecretInputs({ | ||
| cfg, | ||
| mode: cfg.gateway?.mode === "remote" ? "remote" : "local", |
There was a problem hiding this comment.
Use local probe credential mode in update restart verification
The update restart health path reuses the same mode switch and can select gateway.remote.* auth when gateway.mode=remote, even though it verifies a local gateway service restart. This can regress successful restarts into spurious “did not become healthy” warnings for users whose remote credentials differ from local daemon auth. The resolver for post-update local restart checks should be pinned to local credential mode.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 310fc4da88. The post-update restart verification now uses the same shared local restart-probe auth resolver, so it no longer picks gateway.remote.* credentials when validating a local daemon restart.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4dcf855e89
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -383,6 +383,7 @@ export async function gatherDaemonStatus( | |||
| service, | |||
| port: daemonPort, | |||
| env: serviceEnv, | |||
| probeAuth: daemonProbeAuth, | |||
There was a problem hiding this comment.
Keep status restart probes on local credential mode
gatherDaemonStatus now forwards daemonProbeAuth into inspectGatewayRestart, but that auth is derived from gateway.mode and can resolve gateway.remote.* credentials in remote mode. inspectGatewayRestart performs a local reachability check against ws://127.0.0.1:<port>, so this can inject remote credentials into a local health probe and produce false “unhealthy” results when local and remote auth differ. Split the auth paths so remote-mode credentials are used for the remote RPC probe only, while local restart-health inspection always resolves local probe credentials.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 310fc4da88. I split the auth handling in status.gather.ts so the remote RPC probe keeps using the normal daemon probe auth path, while inspectGatewayRestart(...) now receives a separately resolved local restart-probe auth.
Added a test covering remote mode to verify the local restart-health path no longer reuses remote credentials.
|
Addressed the review feedback in 310fc4da88.
Validation run locally:
|
|
Latest feedback has been addressed in 310fc4da88, checks are green, and the PR is ready to merge. Please merge when convenient. Thanks! |
|
Following up here: the latest review feedback has been addressed and the restart-health auth path is now aligned across the affected restart flows. The current remaining failures appear to be CI noise outside the narrow auth-path change itself, so this should be ready for another maintainer pass when convenient. Thanks. |
|
Following up here: the latest review feedback has been addressed and the restart-health auth path is now aligned across the affected restart flows. Local validation passed after the fix, and the branch should be ready for another maintainer pass when convenient. Thanks. |
310fc4d to
538e108
Compare
|
Refreshed this branch on top of the latest upstream/main and force-pushed the update. This keeps restart-health probe auth pinned to the local gateway auth path across daemon restart, update restart, and status inspection, and includes the focused test coverage for that auth split. Re-ran:
Current tsgo/check is blocked on upstream main by existing conflict markers in src/config/io.write-config.test.ts. |
|
Rebased onto the latest Re-ran:
This should clear the old dirty/rebase state and put the branch back in a mergeable shape. |
538e108 to
1f624f1
Compare
|
Rebased onto the latest Re-ran:
This should clear the old dirty/rebase state and put the branch back in a mergeable shape. |
1f624f1 to
ed84af0
Compare
|
Refreshed this PR onto the latest Validation:
This is back in a mergeable state and ready for another review pass. |
ed84af0 to
c4c17ed
Compare
|
Follow-up: Re-validated after the second rebase:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c4c17ed10a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const { auth } = await resolveGatewayProbeAuthSafeWithSecretInputs({ | ||
| cfg: params.cfg, | ||
| mode: "local", | ||
| env: params.env, |
There was a problem hiding this comment.
Prevent remote token fallback in local restart probe auth
This helper is used to authenticate local restart-health probes, but calling resolveGatewayProbeAuthSafeWithSecretInputs({ mode: "local" }) here can still return gateway.remote.* credentials when local auth is unset, because local-mode credential resolution falls back to the remote token/password. That means users in remote mode who only configured gateway.remote.token can still send remote credentials to the local 127.0.0.1 probe and get false unhealthy/timeouts after restart. Fresh evidence: resolveLocalGatewayCredentials in src/gateway/credentials.ts still falls back to plan.remoteToken.value/plan.remotePassword.value when local values are not configured, so mode: "local" alone does not enforce local-only auth.
Useful? React with 👍 / 👎.
|
CI note: the current red jobs still match the upstream baseline, not this PR diff. Examples from this run:
Focused validation for this PR is still green:
|
Summary
Problem
When local gateway auth is configured in openclaw.json but OPENCLAW_GATEWAY_TOKEN is not exported in the current shell, openclaw gateway restart can report a timeout even though the launchd or service-managed gateway eventually comes up and starts listening.
Gateway status already resolves probe auth from config, but restart health checks were probing reachability with env-only auth. That mismatch caused false negatives during warm-up and token-authenticated restarts.
Fix
Verification