Onboarding: add opt-in rescue watchdog for self-healing gateway recovery by shichangs · Pull Request #40658 · openclaw/openclaw

shichangs · 2026-03-09T06:01:49Z

AI-assisted: Yes (Codex). Fully tested locally.

Summary

Problem: onboarding had no one-click way to provision an isolated rescue gateway profile that could monitor and repair the primary local profile.
Why it matters: if the main gateway goes down, IM-driven self-recovery is unavailable unless a second independently managed profile exists.
What changed: added interactive and non-interactive rescue-watchdog onboarding, isolated rescue profile setup, rescue cron provisioning, and docs/tests for the new flow.
What did NOT change (scope boundary): no changes to remote onboarding, no channel credential cloning into rescue, and no changes to the existing gateway service/install model outside this opt-in flow.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Related Feature Request: Self-healing gateway with restart countdown, config backup & crash recovery #31480

User-visible / Behavior Changes

openclaw onboard can now offer an opt-in rescue watchdog during local onboarding.
openclaw onboard --rescue-watchdog provisions a second isolated rescue profile, managed gateway service, and rescue cron job.
Non-interactive onboarding JSON output now includes rescueWatchdog details when that option is enabled.
Re-running onboarding preserves rescue-only auth profiles instead of overwriting them.
Interactive onboarding now correctly prompts for rescue watchdog when the flag is not explicitly passed.

Security Impact (required)

New permissions/capabilities? (Yes/No): Yes
Secrets/tokens handling changed? (Yes/No): Yes
New/changed network calls? (Yes/No): Yes
Command/tool execution surface changed? (Yes/No): Yes
Data access scope changed? (Yes/No): No
If any Yes, explain risk + mitigation:
This is an explicit opt-in feature that provisions a second local gateway profile with its own token, workspace, service, and cron job. The rescue profile does not clone channel/web config from the primary profile, does not deliver messages externally, and runs isolated cron checks against the primary profile only. Rescue auth-profile syncing now preserves rescue-local credentials while merging inherited primary credentials.

Repro + Verification

Environment

OS: macOS
Runtime/container: Node 22 / pnpm workspace
Model/provider: N/A
Integration/channel (if any): N/A
Relevant config (redacted): local onboarding config only

Steps

Run openclaw onboard without passing --rescue-watchdog.
Confirm the wizard offers the rescue watchdog prompt, then enable it or run openclaw onboard --non-interactive --rescue-watchdog ....
Inspect the generated rescue profile config/service/cron and re-run onboarding to verify rescue auth profiles are preserved.

Expected

Interactive onboarding offers the rescue prompt by default.
Rescue setup writes to an isolated rescue profile/state/config path.
Existing rescue-only auth profiles remain present after re-running onboarding.

Actual

Verified by automated tests and local repo checks; behavior now matches the expected outcomes above.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

All rescue-related tests (onboard-rescue.test.ts, onboard-rescue.setup.test.ts, rescue-watchdog.test.ts, watchdog-shared.test.ts) pass. Unrelated test mock stabilization fixes are included.

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios:
- --rescue-watchdog is forwarded in CLI mode and remains undefined when not explicitly passed.
- Interactive onboarding prompts for rescue watchdog by default.
- Rescue setup uses isolated rescue profile env/state/config paths.
- Rescue auth store persistence keeps rescue-only credentials while inheriting primary credentials.
Edge cases checked:
- Rescue profile names ending in -rescue remain unsupported.
- Re-running onboarding does not copy primary cron settings into a fresh rescue profile.
- Rescue install errors preserve the original cause.
What you did not verify:
- Manual launchd/systemd end-to-end behavior on a real long-running host.
- Real IM-channel recovery against a live production gateway.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes/No): Yes
Config/env changes? (Yes/No): Yes (new rescueWatchdog config section, opt-in only)
Migration needed? (Yes/No): No

Failure Recovery (if this breaks)

How to disable/revert this change quickly:
Re-run onboarding without rescue, or remove the rescue profile/service and its cron job.
Files/config to restore:
Rescue profile state dir (e.g. ~/.openclaw-rescue*) and its managed gateway service definition.
Known bad symptoms reviewers should watch for:
Rescue config being written into the primary profile path, missing rescue prompt in interactive onboarding, or rescue auth profiles disappearing after onboarding reruns.

Risks and Mitigations

Risk: Rescue setup could accidentally write into the active primary profile state/config when onboarding runs under a non-default profile.
- Mitigation: Rescue env construction clears inherited profile-derived env vars before applying the rescue profile.
Risk: Re-running onboarding could erase rescue-local credentials.
- Mitigation: Rescue auth syncing loads from the rescue agent dir so existing rescue credentials are preserved and merged.
Risk: Interactive onboarding could silently skip the rescue prompt.
- Mitigation: CLI flag forwarding preserves undefined for the unset case, and regression tests cover the prompt path.

aisle-research-bot · 2026-03-09T06:01:56Z

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

#	Severity	Title
1	🟠 High	Rescue watchdog onboarding can leak rescue gateway token via weak loopback port ownership verification (TOCTOU/port hijack)
2	🟡 Medium	TOCTOU symlink swap allows LaunchAgent plist write outside intended directory (parent directory not protected)
3	🟡 Medium	Symlink-following, non-atomic write of auth-profiles.json can overwrite arbitrary files and risk secret exposure

1. 🟠 Rescue watchdog onboarding can leak rescue gateway token via weak loopback port ownership verification (TOCTOU/port hijack)

Property	Value
Severity	High
CWE	CWE-287
Location	`src/commands/onboard-rescue.ts:342-366`

Description

setupRescueWatchdog() provisions a rescue gateway token and later connects to ws://127.0.0.1:<rescuePort> using that token. The intended protection is waitForRescueGatewayIdentity(), but the identity check can be satisfied by a non-OpenClaw process in some environments.

Key issues:

TOCTOU port allocation race: allocateLoopbackPort() selects an ephemeral port by binding and then closing it; another local process can bind that port before the managed gateway successfully starts.
Weak “gateway identity” heuristic: waitForRescueGatewayIdentity() treats a WebSocket close code 1008 with an auth-like reason as “looks like a gateway”. A malicious WebSocket server can trivially mimic this.
Unsafe fallback when ownership metadata is unavailable: if service.readRuntime() returns unknown/no PID and inspectPortUsage() can’t identify the listener process (e.g., missing lsof/insufficient privileges → listeners.length === 0), the function will accept the probe result without confirming the listener belongs to the managed service.
After this acceptance, onboarding calls waitForGatewayReachable() / callGateway() with the rescue token, disclosing it to whatever process is actually bound to that loopback port.

Impact:

A local attacker (another user on the same host) can potentially capture the rescue gateway token and then use it to access the rescue gateway API.
Because the rescue profile config copies auth, secrets, models, etc., token compromise may enable exfiltration of sensitive configuration/secrets via gateway methods.

Vulnerable code:

const ownershipMetadataUnavailable =
  runtime.status !== "stopped" &&
  (runtime.status === "unknown" || runtime.pid == null) &&
  (portUsage.status !== "busy" || portUsage.listeners.length === 0);

const probeLooksLikeGateway =
  probe?.ok === true || looksLikeAuthClose(probe?.close?.code, probe?.close?.reason);

if (probeLooksLikeGateway && (listenerOwnedByRuntime || ownershipMetadataUnavailable)) {
  return;
}

Recommendation

Tighten the port ownership check so the onboarding flow never sends the rescue token unless the listener is strongly attributable to the managed service.

Recommended changes:

Remove the unauthenticated “auth-close (1008)” heuristic as an identity signal. Only accept probe.ok === true and only after verifying listener ownership.
Require verifiable listener ownership:
- Require inspectPortUsage() to return at least one listener with a PID.
- Require that PID to match the supervisor-reported PID (or a child PID) when available.
- If runtime PID is unavailable, fail closed (ask the user to install lsof/run with needed privileges) rather than proceeding.
Consider reducing TOCTOU:
- Avoid “allocate then release” where possible (e.g., choose a deterministic port and check it’s free right before starting; or retry with a new port if bind fails).

Example safer logic (fail closed when you cannot attribute the port):

const probe = await probeGateway({ url: wsUrl, timeoutMs: 1_000 }).catch(() => null);
if (!probe?.ok) {
  await sleep(POLL);
  continue;
}

if (
  runtime.status !== "running" ||
  typeof runtime.pid !== "number" ||
  portUsage.status !== "busy" ||
  portUsage.listeners.length === 0 ||
  !portUsage.listeners.some(l => l.pid === runtime.pid || l.ppid === runtime.pid)
) {
  throw new Error("Refusing to send rescue token: could not prove managed service owns the port");
}

This ensures the onboarding process does not disclose credentials to an untrusted local listener.

2. 🟡 TOCTOU symlink swap allows LaunchAgent plist write outside intended directory (parent directory not protected)

Property	Value
Severity	Medium
CWE	CWE-367
Location	`src/daemon/launchd.ts:146-196`

Description

The new ensureSecureDirectory() and writeLaunchAgentPlistSecure() add some symlink checks and use O_NOFOLLOW for the temporary file, but there is still a time-of-check/time-of-use gap and symlink traversal via parent directories:

installLaunchAgent() checks/creates the LaunchAgents directory via ensureSecureDirectory(path.dirname(plistPath)).
Later, writeLaunchAgentPlistSecure() computes dir = path.dirname(plistPath) and opens tempPath with O_NOFOLLOW.
O_NOFOLLOW only protects the final path component (tempPath) and does not prevent following a symlink in any parent component (e.g., if ~/Library/LaunchAgents is swapped to a symlink after ensureSecureDirectory returns).
Because writeLaunchAgentPlistSecure() only calls lstat() on plistPath/tempPath (which follows symlinks in parent components during path resolution), it does not detect a symlinked parent directory.

Impact (local attack): if an attacker can rename/replace ~/Library/LaunchAgents (possible when ~/Library is group/world-writable, and this patch also removed previous attempts to tighten home/~/Library), they can race-replace LaunchAgents with a symlink to an attacker-controlled directory. The installer will then write the plist into that other location and proceed to call launchctl bootstrap ... plistPath, potentially enabling code execution as the victim user.

Vulnerable code (symlink protection is not anchored to a directory file descriptor; parent components can be swapped):

await ensureSecureDirectory(path.dirname(plistPath));
...
await writeLaunchAgentPlistSecure(plistPath, plist);

const dir = path.dirname(plistPath);
const tempPath = path.join(dir, `.${path.basename(plistPath)}.${process.pid}.tmp`);
const handle = await fs.open(tempPath, O_NOFOLLOW | ...);

Recommendation

Anchor operations to a verified, non-symlink directory and reduce TOCTOU windows.

Practical hardening options (best to combine):

Re-secure parent directories before use (regression fix): reintroduce permission tightening for home and ~/Library (or at least verify they are not group/world-writable) so other users cannot rename/swap LaunchAgents.
Re-check the directory immediately before file creation and before bootstrap, and fail closed if it is a symlink or is group/other-writable.
If possible in your runtime/Node version, use OS primitives that avoid path re-resolution (preferred): open the target directory with O_DIRECTORY | O_NOFOLLOW, then create/write the file relative to that directory using openat-style APIs (or a native helper) so symlink swapping in parent components cannot redirect the write.

Example of additional directory verification (defense-in-depth):

async function assertSecureDir(dir: string) {
  const st = await fs.lstat(dir);
  if (!st.isDirectory() || st.isSymbolicLink()) throw new Error(`Insecure dir: ${dir}`);
  const mode = st.mode & 0o777;
  if ((mode & 0o022) !== 0) throw new Error(`Dir is writable by group/other: ${dir}`);
}

await assertSecureDir(path.dirname(plistPath));
await writeLaunchAgentPlistSecure(plistPath, plist);
await assertSecureDir(path.dirname(plistPath)); // again before launchctl bootstrap

This does not fully eliminate TOCTOU without openat, but it significantly raises the bar and restores the previous protection against group/world-writable parent directories.

3. 🟡 Symlink-following, non-atomic write of auth-profiles.json can overwrite arbitrary files and risk secret exposure

Property	Value
Severity	Medium
CWE	CWE-59
Location	`src/infra/json-file.ts:16-22`

Description

setupRescueWatchdog() (new rescue onboarding flow) copies API keys/tokens into the rescue profile via saveAuthProfileStore(...). That eventually persists credentials using saveJsonFile(), which performs a plain writeFileSync() to the target path.

Because saveJsonFile():

does not guard against symlinks (no lstat/O_NOFOLLOW), so a pre-existing auth-profiles.json symlink can redirect the write to an arbitrary file (CWE-59)
is not atomic (direct truncation+write), so a crash/kill can leave partially-written JSON and potentially break auth loading
creates the file with default permissions first, then chmod(0o600) after; if the parent directory is accessible to other local users (e.g., state dir overridden to a shared path), there is a brief window where secrets may be written with umask-derived permissions (commonly 0644)

This becomes higher impact with rescue watchdog because it intentionally mirrors credentials into the rescue agent directory.

Vulnerable code:

// src/infra/json-file.ts
fs.writeFileSync(pathname, `${JSON.stringify(data, null, 2)}\n`, "utf8");
fs.chmodSync(pathname, 0o600);

Call chain in the rescue flow:

src/commands/onboard-rescue.ts calls saveAuthProfileStore(..., rescueAgentDir)
src/agents/auth-profiles/store.ts calls saveJsonFile(authPath, payload)
src/infra/json-file.ts performs the unsafe write

Recommendation

Harden credential/config writes against symlink attacks and make them atomic.

Recommended changes:

Write atomically: write to a temp file in the same directory with mode 0o600, fsync, then rename.
Refuse symlinks: verify the destination is not a symlink via lstat, and open temp files with O_NOFOLLOW where supported.
Set permissions at creation time (not after), to avoid any transient world-readable window.

Example secure implementation (POSIX-focused; best-effort on Windows):

import { constants as fsConstants } from "node:fs";
import fs from "node:fs/promises";
import path from "node:path";

export async function saveJsonFileAtomicNoSymlink(pathname: string, data: unknown) {
  const dir = path.dirname(pathname);
  await fs.mkdir(dir, { recursive: true, mode: 0o700 });

  // Refuse writing through an existing symlink
  try {
    const st = await fs.lstat(pathname);
    if (st.isSymbolicLink()) throw new Error("refusing to write to symlink");
  } catch (e: any) {
    if (e?.code !== "ENOENT") throw e;
  }

  const tmp = path.join(dir, `.${path.basename(pathname)}.${process.pid}.tmp`);
  const nofollow = typeof fsConstants.O_NOFOLLOW === "number" ? fsConstants.O_NOFOLLOW : 0;
  const fh = await fs.open(tmp, fsConstants.O_WRONLY | fsConstants.O_CREAT | fsConstants.O_EXCL | nofollow, 0o600);
  try {
    await fh.writeFile(`${JSON.stringify(data, null, 2)}\n`, { encoding: "utf8" });
    await fh.sync();
  } finally {
    await fh.close();
  }
  await fs.rename(tmp, pathname);
}

Then update saveAuthProfileStore() to use this hardened helper for auth-profiles.json writes.

Analyzed PR: #40658 at commit 78a057e

_{Last updated on: 2026-03-12T13:54:27Z}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 813e755556

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

greptile-apps · 2026-03-09T06:08:12Z

Greptile Summary

This PR adds opt-in rescue watchdog onboarding — a second isolated gateway profile that monitors and auto-restarts the primary profile. The implementation is well-scoped, correctly isolates the rescue profile env/state from the primary, preserves rescue-only credentials on re-runs, and includes solid unit + integration test coverage.

Key findings:

Security: Math.random() is used to generate the rescue gateway auth token (resolveRescueGatewayToken). Math.random() is not a CSPRNG; the resulting token has ~53 bits of entropy sourced from a predictable RNG. A cryptographic source (crypto.randomBytes or crypto.randomUUID) should be used instead.
Robustness: syncRescueAuthProfiles falls back to an empty string when OPENCLAW_STATE_DIR is absent from the rescue env, which silently creates agents/main/agent relative to the process working directory. A guard/assertion would be safer.
UX: The interactive rescue watchdog confirm prompt fires unconditionally in quickstart flow (whenever no explicit flag is passed), breaking the quickstart contract of minimal user interaction.

Confidence Score: 3/5

The PR is safe to merge with the token generation security issue addressed first.
The core logic is sound and tests are thorough, but the use of Math.random() for an auth token that protects a network-accessible gateway service is a meaningful security concern that should be fixed before merging. The other two findings are lower-severity (a defensive coding gap and a minor UX inconsistency in quickstart flow).
src/commands/onboard-rescue.ts — token generation and OPENCLAW_STATE_DIR fallback; src/wizard/onboarding.finalize.ts — quickstart flow rescue prompt.

_{Last reviewed commit: 813e755}

shichangs · 2026-03-09T06:24:51Z

Addressed the review feedback in follow-up commit e0e57e7f2.

What changed:

Switched rescue gateway token generation to the existing CSPRNG-backed randomToken() helper instead of Math.random().
Added an explicit guard for unresolved rescue OPENCLAW_STATE_DIR before touching the rescue auth store path.
Kept QuickStart minimal by only showing the interactive rescue prompt in non-QuickStart flows unless --rescue-watchdog is passed explicitly.
Added regression coverage for the QuickStart prompt behavior and kept the setup test covering isolated rescue env/state paths.
Updated docs to call out the QuickStart behavior.

Local verification after the follow-up:

pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/cli/program/register.onboard.test.ts src/wizard/onboarding.finalize.test.ts
pnpm check

A new push is up now, so CI should rerun on the updated commits.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0e57e7f2a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-09T06:41:56Z

Addressed the remaining onboarding/rescue review items in follow-up commit 93b132d78.

What changed:

Reconcile existing rescue services against the current install plan and reinstall them when the managed command drifts, instead of only restarting an outdated service.
Gate non-interactive rescue enablement before side effects: rescue profiles no longer force daemon install, and Linux runs without user systemd now skip rescue setup cleanly.
Added regression coverage for both non-interactive rescue gating and rescue-service reinstall behavior.
Re-ran detect-secrets on the full PR file set; .secrets.baseline stays clean on the current head.

Local verification on this head:

pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/commands/onboard-non-interactive/local.test.ts src/cli/program/register.onboard.test.ts src/wizard/onboarding.finalize.test.ts
pnpm check
/usr/bin/python3 -m pre_commit run detect-secrets --files docs/cli/onboard.md docs/reference/wizard.md docs/start/wizard-cli-automation.md docs/start/wizard-cli-reference.md docs/start/wizard.md src/cli/program/register.onboard.test.ts src/cli/program/register.onboard.ts src/commands/onboard-non-interactive/local.ts src/commands/onboard-non-interactive/local/output.ts src/commands/onboard-rescue.setup.test.ts src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.ts src/commands/onboard-types.ts src/wizard/onboarding.finalize.test.ts src/wizard/onboarding.finalize.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93b132d78e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-09T08:28:16Z

Follow-up is in 27f3d63.

This addresses the remaining review items on this PR:

non-interactive --rescue-watchdog now exits non-zero if setupRescueWatchdog() fails instead of returning success
interactive onboarding now skips rescue provisioning when the primary managed service install failed earlier in the same run
rescue setup now rejects invalid monitored profile names before deriving profile/state paths
rescue setup now builds its service env from a narrow allowlist, so ambient proxy env does not get copied into the rescue service definition

Validation:

pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/commands/onboard-non-interactive/local.test.ts src/commands/onboard-non-interactive/local.run.test.ts src/wizard/onboarding.finalize.test.ts
pnpm tsgo

Local pnpm check is still blocked here by an existing unrelated type mismatch in ui/vitest.config.ts (duplicate Vitest browser provider types in node_modules), not by these rescue changes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27f3d6376e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-09T11:56:42Z

Follow-up is in 1a2d69e.

This closes out the remaining onboarding/CI issues on the current head:

non-interactive rescue setup now aborts before setupRescueWatchdog() if the primary managed service install failed
installGatewayDaemonNonInteractive() now returns success/failure so automation can gate rescue provisioning correctly
macOS launchd install paths now tighten directory/file modes and normalize LaunchAgents/Trash paths with POSIX joins, matching the current main expectations that were failing in Windows merge CI
detect-secrets now excludes Sparkle sparkle:edSignature appcast metadata so the secrets job no longer flags release signatures as credentials

Validation on this head:

pnpm test src/daemon/launchd.test.ts src/commands/onboard-non-interactive/local/daemon-install.test.ts src/commands/onboard-non-interactive/local.run.test.ts src/commands/onboard-non-interactive/local.test.ts src/wizard/onboarding.finalize.test.ts
pnpm tsgo
/usr/bin/python3 -m pre_commit run detect-secrets --files .detect-secrets.cfg .pre-commit-config.yaml .secrets.baseline appcast.xml

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a2d69e8dd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-09T14:42:19Z

Follow-up is in 1807d4e.

This fixes the remaining long-profile collision in rescue onboarding:

short monitored profile names still keep the existing <profile>-rescue mapping
when a monitored profile must be truncated, the derived rescue profile now appends a stable SHA-256 suffix before -rescue, so different valid profiles that share the same long prefix no longer overwrite each other's rescue config/state
added regression coverage for two long valid profile names with the same 57-character prefix to confirm they resolve to distinct rescue profiles and stay within the profile-name limit

Validation on this head:

pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts
pnpm tsgo

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c455b7524

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-09T14:59:00Z

Follow-up is in 22cc27d.

This fixes the two remaining rescue onboarding regressions on the current head:

rescue auth sync now preserves rescue-only credentials while letting the primary profile override stale duplicate profile IDs on onboarding reruns, so rotated primary keys propagate into the rescue store
rescue config now carries config.env into fresh rescue profiles and keeps rescue-only env keys on reruns while letting the primary profile refresh overlapping env values
added regression coverage for both the stale-auth rerun case and env inheritance/merge behavior

Validation on this head:

pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts
pnpm tsgo

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22cc27d532

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-09T15:47:00Z

Follow-up is in b1a4d4249.

This tightens the remaining rescue/launchd hardening on the current head:

rescue onboarding now refuses to overwrite an existing rescue / <profile>-rescue profile unless that profile is already marked as an OpenClaw-managed rescue watchdog for the same monitored profile
rescue setup allocates an ephemeral loopback port for fresh rescue profiles, preserves an existing managed rescue port on reruns, and proves listener ownership before sending the rescue gateway token
rescue watchdog state now carries an explicit wizard.rescueWatchdog ownership marker plus a dedicated least-privilege rescue-watchdog agent id for cron runs
rescue cron jobs now pin agentId, keep allowUnsafeExternalContent: false, and use the dedicated watchdog agent instead of the default agent
launchd install now rejects symlinked LaunchAgents paths and writes plist files via O_NOFOLLOW + atomic rename
config schema metadata/validation now includes the new wizard.rescueWatchdog marker fields

Validation on this head:

pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/daemon/launchd.test.ts src/config/schema.help.quality.test.ts
pnpm tsgo
pnpm check
pnpm build

shichangs · 2026-03-09T16:27:02Z

Follow-up in 05cd77e:

Replaced the rescue watchdog cron payload with a deterministic rescueWatchdog runner, so recovery no longer goes through agentTurn prompt execution.
Tightened the rescue agent sandbox to allow: [] + deny: ["*"]; the cron job now runs fixed probe/restart/doctor logic and only invokes exact argv for openclaw --profile <profile> doctor --repair --non-interactive.
Added cron/schema/store normalization for the new payload kind and covered the deterministic runner with unit tests.

Local verification:

pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/cron/normalize.test.ts src/cron/service.jobs.test.ts src/cron/rescue-watchdog.test.ts
pnpm build
pnpm lint

pnpm check still stops in the repo's existing ui/vitest.config.ts Vitest type mismatch during pnpm tsgo; this change does not introduce a new local type-check failure outside that pre-existing blocker.

shichangs · 2026-03-09T16:36:14Z

Follow-up in 413b857:

Fail closed in non-interactive onboarding when OPENCLAW_PROFILE is invalid, before rescue planning can force daemon install.
Use the process homedir for macOS LaunchAgent plist/trash paths instead of trusting HOME overrides during service management.
Tighten LaunchAgent plist permissions from 0644 to 0600 and stop chmod-tightening the parent ~/Library path.
Added regression coverage for invalid rescue profile rejection and conflicting HOME overrides in launchd install.

Local verification on this head:

pnpm test src/daemon/launchd.test.ts src/commands/onboard-non-interactive/local.test.ts src/commands/onboard-non-interactive/local.run.test.ts
pnpm lint
pnpm build
pnpm check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 413b85725e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-09T17:33:21Z

Follow-up in e8c8286:

The deterministic rescue watchdog now caps doctor --repair --non-interactive to the cron job's remaining timeout budget instead of a fixed 15s+ floor.
If restart probing already consumed the configured budget, the watchdog now skips spawning doctor entirely and returns an error, preventing overlapping background repair processes.
Added a regression test covering the exhausted-budget path.

Local verification:

pnpm test src/cron/rescue-watchdog.test.ts
pnpm lint

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8c8286195

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 621cb9eb23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f441a685c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-11T18:11:21Z

Follow-up pushed in 1ebacdb.

What changed:

Stabilized caps per-probe timeout to the remaining probe budget in src/cron/rescue-watchdog.test.ts by removing the brittle mock.calls[1] assumption.
The test now forces a bounded post-restart probe path and asserts against observed probe timeout values without assuming a fixed call index.

Validation:

pnpm test src/cron/rescue-watchdog.test.ts ✅
pnpm build ✅
pnpm check ✅
pnpm test ⚠️ fails in this sandbox due unrelated environment constraints (listen EPERM for local socket-binding tests and transient DNS ENOTFOUND in remote-provider tests), not due this test change.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ebacdbad8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

shichangs · 2026-03-11T19:40:41Z

All review threads have been resolved and CI is green. This PR adds an opt-in rescue watchdog for self-healing gateway recovery during onboarding.

Would appreciate a review when you have a moment — cc @joshavant @tyler6204

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e4d128950

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 706d1a44aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0772829039

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Extend config/cron/daemon type definitions with rescue watchdog fields. Add watchdog-shared module for building isolated rescue profile environments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…emd, and schtasks Support installing, starting, and stopping isolated rescue gateway services across macOS (launchd), Linux (systemd), and Windows (Task Scheduler). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implement the rescue watchdog cron job that probes the primary gateway health and triggers doctor-based recovery when the primary is unreachable. Includes timeout budgeting, probe URL normalization, and job validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wire rescue watchdog setup into both interactive and non-interactive onboarding paths. Add --rescue-watchdog CLI flag, rescue profile provisioning, daemon service installation, and wizard UX integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document the --rescue-watchdog flag, wizard flow, and automation examples. Fix unrelated test mock typings exposed during development. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shichangs · 2026-03-12T04:57:59Z

Commit history cleanup: squashed 44 incremental commits into 5 logical commits:

feat(rescue): add rescue watchdog core types and shared utilities
feat(daemon): add rescue profile service management for launchd, systemd, and schtasks
feat(cron): add rescue watchdog job engine
feat(onboard): add opt-in rescue watchdog onboarding flow
docs(rescue): add rescue watchdog documentation and stabilize test mocks

Also updated the PR description to fill in all required template sections (Human Verification, Compatibility, Failure Recovery, Risks). This is the final force push for history cleanup — incremental commits only going forward.

Ready for review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c645a2316b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…off) Integrate main's kickstart-based restart with detached handoff logic while preserving the rescue watchdog signal parameter threading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d5227c9ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T06:01:05Z

+  if (rescueWatchdogEnabled && !primaryManagedServiceReady) {
+    await prompter.note(
+      "Rescue watchdog requires a healthy primary managed service. Gateway service install failed during onboarding, so rescue watchdog was skipped.",
+      "Rescue watchdog",
+    );


Fail onboarding when explicit rescue setup gets skipped

When --rescue-watchdog is explicitly requested, this branch only shows a note and continues after primary managed-service install fails, so interactive onboarding can exit successfully without provisioning the watchdog the user asked for. This is inconsistent with the explicit-error behavior already used for other rescue setup failures and with the non-interactive path, and it can silently break automation/checklists that treat command success as confirmation that rescue protection is active.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-12T06:01:05Z

+  try {
+    return io.loadConfig();
+  } catch (error) {
+    throw new Error(
+      `Rescue watchdog setup failed: existing rescue profile config at "${configPath}" could not be loaded: ${error instanceof Error ? error.message : String(error)}`,


Handle non-ENOENT errors when checking rescue config

This catch turns every fs.access failure into "config missing", so permission/I/O errors (for example EACCES) are treated the same as ENOENT. In that case onboarding skips ownership validation and proceeds as if no rescue profile exists, which can lead to overwriting an existing unmanaged rescue profile or producing confusing follow-on failures instead of failing fast with the real access error.

Useful? React with 👍 / 👎.

shichangs · 2026-03-12T12:43:51Z

Addressed two remaining review items in 78a057e:

explicit --rescue-watchdog now fails onboarding when primary managed-service install fails (instead of soft-skipping).
loadExistingRescueConfig now only treats ENOENT as missing; non-ENOENT access errors fail closed.

Also added regression tests in:

src/wizard/onboarding.finalize.test.ts
src/commands/onboard-rescue.setup.test.ts

Validation in this sandbox is blocked by dependency install DNS failures (ENOTFOUND registry.npmjs.org), so full pnpm build && pnpm check && pnpm test could not complete.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78a057ea6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T12:54:11Z

+      env[key] = value;
+    }
+  }
+  applyCliProfileEnv({ profile, env });


Avoid dev port override in cross-profile watchdog env

buildRescueProfileEnv drops service-identity overrides for cross-profile runs, but then always calls applyCliProfileEnv; for target profile dev, that helper reintroduces OPENCLAW_GATEWAY_PORT=19001 when the variable is absent. In rescue watchdog runs monitoring a dev profile with a non-default configured gateway.port, resolveGatewayPort(cfg, env) will now prefer this injected env value and probe the wrong port, which can misclassify a healthy gateway as down and trigger unnecessary restart/repair actions.

Useful? React with 👍 / 👎.

openclaw-barnacle Bot added docs Improvements or additions to documentation cli CLI command changes commands Command implementations size: L labels Mar 9, 2026

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/commands/onboard-rescue.ts

greptile-apps Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/commands/onboard-rescue.ts

Comment thread src/wizard/onboarding.finalize.ts

Comment thread src/commands/onboard-rescue.ts

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/commands/onboard-non-interactive/local.ts Outdated

Comment thread src/commands/onboard-non-interactive/local.ts Outdated

openclaw-barnacle Bot added size: XL and removed size: L labels Mar 9, 2026

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/commands/onboard-non-interactive/local.ts Outdated

Comment thread src/wizard/onboarding.finalize.ts Outdated

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/commands/onboard-non-interactive/local.ts

openclaw-barnacle Bot added the gateway Gateway runtime label Mar 9, 2026

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/commands/onboard-rescue.ts Outdated

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/commands/onboard-rescue.ts Outdated

Comment thread src/commands/onboard-rescue.ts

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/commands/onboard-rescue.ts Outdated

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/cron/rescue-watchdog.ts Outdated

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/rescue/watchdog-shared.ts

Comment thread src/cron/rescue-watchdog.ts

openclaw-barnacle Bot added the app: web-ui App: web-ui label Mar 9, 2026

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread src/commands/onboard-rescue.ts Outdated

shichangs force-pushed the codex/onboard-rescue-watchdog branch from 621cb9e to f59be9e Compare March 11, 2026 16:57

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread src/cron/rescue-watchdog.ts Outdated

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread src/cron/rescue-watchdog.ts

shichangs force-pushed the codex/onboard-rescue-watchdog branch from e9f43ae to 5d17041 Compare March 11, 2026 19:34

shichangs force-pushed the codex/onboard-rescue-watchdog branch from 5d17041 to 7e4d128 Compare March 11, 2026 20:09

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread src/wizard/onboarding.finalize.ts

shichangs force-pushed the codex/onboard-rescue-watchdog branch from a2bb059 to 706d1a4 Compare March 11, 2026 21:13

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread src/cron/rescue-watchdog.ts Outdated

shichangs force-pushed the codex/onboard-rescue-watchdog branch from 706d1a4 to b08b7e2 Compare March 11, 2026 22:08

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread src/cron/rescue-watchdog.ts

SC-Claw and others added 5 commits March 12, 2026 12:55

feat(rescue): add rescue watchdog core types and shared utilities

d313ad8

Extend config/cron/daemon type definitions with rescue watchdog fields. Add watchdog-shared module for building isolated rescue profile environments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs(rescue): add rescue watchdog documentation and stabilize test mocks

c645a23

Document the --rescue-watchdog flag, wizard flow, and automation examples. Fix unrelated test mock typings exposed during development. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shichangs force-pushed the codex/onboard-rescue-watchdog branch from 2d57ccd to c645a23 Compare March 12, 2026 04:57

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread src/commands/onboard-rescue.ts

SC-Claw and others added 2 commits March 12, 2026 13:58

merge: resolve conflicts with main (kickstart restart + detached hand…

0e7bfc6

…off) Integrate main's kickstart-based restart with detached handoff logic while preserving the rescue watchdog signal parameter threading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rescue: reject stopped runtime metadata fallback

65804e6

shichangs force-pushed the codex/onboard-rescue-watchdog branch from 5d5227c to 65804e6 Compare March 12, 2026 05:59

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

Onboarding: fail explicit rescue skips and fail closed on config access

78a057e

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

shichangs mentioned this pull request Mar 12, 2026

Onboarding: add opt-in rescue watchdog for self-healing gateway recovery #44113

Closed

20 tasks

shichangs closed this Mar 12, 2026

Uh oh!

Conversation

shichangs commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

aisle-research-bot Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Aisle Security Analysis

1. 🟠 Rescue watchdog onboarding can leak rescue gateway token via weak loopback port ownership verification (TOCTOU/port hijack)

Description

Recommendation

2. 🟡 TOCTOU symlink swap allows LaunchAgent plist write outside intended directory (parent directory not protected)

Description

Recommendation

3. 🟡 Symlink-following, non-atomic write of auth-profiles.json can overwrite arbitrary files and risk secret exposure

Description

Recommendation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

greptile-apps Bot commented Mar 9, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shichangs commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

shichangs commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

shichangs commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

shichangs commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

shichangs commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

shichangs commented Mar 9, 2026 •

edited

Loading

aisle-research-bot Bot commented Mar 9, 2026 •

edited

Loading