Skip to content

Onboarding: add opt-in rescue watchdog for self-healing gateway recovery#40658

Closed
shichangs wants to merge 8 commits intoopenclaw:mainfrom
shichangs:codex/onboard-rescue-watchdog
Closed

Onboarding: add opt-in rescue watchdog for self-healing gateway recovery#40658
shichangs wants to merge 8 commits intoopenclaw:mainfrom
shichangs:codex/onboard-rescue-watchdog

Conversation

@shichangs
Copy link
Copy Markdown
Contributor

@shichangs shichangs commented Mar 9, 2026

AI-assisted: Yes (Codex). Fully tested locally.

Summary

  • Problem: onboarding had no one-click way to provision an isolated rescue gateway profile that could monitor and repair the primary local profile.
  • Why it matters: if the main gateway goes down, IM-driven self-recovery is unavailable unless a second independently managed profile exists.
  • What changed: added interactive and non-interactive rescue-watchdog onboarding, isolated rescue profile setup, rescue cron provisioning, and docs/tests for the new flow.
  • What did NOT change (scope boundary): no changes to remote onboarding, no channel credential cloning into rescue, and no changes to the existing gateway service/install model outside this opt-in flow.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • openclaw onboard can now offer an opt-in rescue watchdog during local onboarding.
  • openclaw onboard --rescue-watchdog provisions a second isolated rescue profile, managed gateway service, and rescue cron job.
  • Non-interactive onboarding JSON output now includes rescueWatchdog details when that option is enabled.
  • Re-running onboarding preserves rescue-only auth profiles instead of overwriting them.
  • Interactive onboarding now correctly prompts for rescue watchdog when the flag is not explicitly passed.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): Yes
  • Secrets/tokens handling changed? (Yes/No): Yes
  • New/changed network calls? (Yes/No): Yes
  • Command/tool execution surface changed? (Yes/No): Yes
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:
    This is an explicit opt-in feature that provisions a second local gateway profile with its own token, workspace, service, and cron job. The rescue profile does not clone channel/web config from the primary profile, does not deliver messages externally, and runs isolated cron checks against the primary profile only. Rescue auth-profile syncing now preserves rescue-local credentials while merging inherited primary credentials.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): local onboarding config only

Steps

  1. Run openclaw onboard without passing --rescue-watchdog.
  2. Confirm the wizard offers the rescue watchdog prompt, then enable it or run openclaw onboard --non-interactive --rescue-watchdog ....
  3. Inspect the generated rescue profile config/service/cron and re-run onboarding to verify rescue auth profiles are preserved.

Expected

  • Interactive onboarding offers the rescue prompt by default.
  • Rescue setup writes to an isolated rescue profile/state/config path.
  • Existing rescue-only auth profiles remain present after re-running onboarding.

Actual

  • Verified by automated tests and local repo checks; behavior now matches the expected outcomes above.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

All rescue-related tests (onboard-rescue.test.ts, onboard-rescue.setup.test.ts, rescue-watchdog.test.ts, watchdog-shared.test.ts) pass. Unrelated test mock stabilization fixes are included.

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • --rescue-watchdog is forwarded in CLI mode and remains undefined when not explicitly passed.
    • Interactive onboarding prompts for rescue watchdog by default.
    • Rescue setup uses isolated rescue profile env/state/config paths.
    • Rescue auth store persistence keeps rescue-only credentials while inheriting primary credentials.
  • Edge cases checked:
    • Rescue profile names ending in -rescue remain unsupported.
    • Re-running onboarding does not copy primary cron settings into a fresh rescue profile.
    • Rescue install errors preserve the original cause.
  • What you did not verify:
    • Manual launchd/systemd end-to-end behavior on a real long-running host.
    • Real IM-channel recovery against a live production gateway.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): Yes (new rescueWatchdog config section, opt-in only)
  • Migration needed? (Yes/No): No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    Re-run onboarding without rescue, or remove the rescue profile/service and its cron job.
  • Files/config to restore:
    Rescue profile state dir (e.g. ~/.openclaw-rescue*) and its managed gateway service definition.
  • Known bad symptoms reviewers should watch for:
    Rescue config being written into the primary profile path, missing rescue prompt in interactive onboarding, or rescue auth profiles disappearing after onboarding reruns.

Risks and Mitigations

  • Risk: Rescue setup could accidentally write into the active primary profile state/config when onboarding runs under a non-default profile.
    • Mitigation: Rescue env construction clears inherited profile-derived env vars before applying the rescue profile.
  • Risk: Re-running onboarding could erase rescue-local credentials.
    • Mitigation: Rescue auth syncing loads from the rescue agent dir so existing rescue credentials are preserved and merged.
  • Risk: Interactive onboarding could silently skip the rescue prompt.
    • Mitigation: CLI flag forwarding preserves undefined for the unset case, and regression tests cover the prompt path.

@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Mar 9, 2026

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

# Severity Title
1 🟠 High Rescue watchdog onboarding can leak rescue gateway token via weak loopback port ownership verification (TOCTOU/port hijack)
2 🟡 Medium TOCTOU symlink swap allows LaunchAgent plist write outside intended directory (parent directory not protected)
3 🟡 Medium Symlink-following, non-atomic write of auth-profiles.json can overwrite arbitrary files and risk secret exposure

1. 🟠 Rescue watchdog onboarding can leak rescue gateway token via weak loopback port ownership verification (TOCTOU/port hijack)

Property Value
Severity High
CWE CWE-287
Location src/commands/onboard-rescue.ts:342-366

Description

setupRescueWatchdog() provisions a rescue gateway token and later connects to ws://127.0.0.1:<rescuePort> using that token. The intended protection is waitForRescueGatewayIdentity(), but the identity check can be satisfied by a non-OpenClaw process in some environments.

Key issues:

  • TOCTOU port allocation race: allocateLoopbackPort() selects an ephemeral port by binding and then closing it; another local process can bind that port before the managed gateway successfully starts.
  • Weak “gateway identity” heuristic: waitForRescueGatewayIdentity() treats a WebSocket close code 1008 with an auth-like reason as “looks like a gateway”. A malicious WebSocket server can trivially mimic this.
  • Unsafe fallback when ownership metadata is unavailable: if service.readRuntime() returns unknown/no PID and inspectPortUsage() can’t identify the listener process (e.g., missing lsof/insufficient privileges → listeners.length === 0), the function will accept the probe result without confirming the listener belongs to the managed service.
  • After this acceptance, onboarding calls waitForGatewayReachable() / callGateway() with the rescue token, disclosing it to whatever process is actually bound to that loopback port.

Impact:

  • A local attacker (another user on the same host) can potentially capture the rescue gateway token and then use it to access the rescue gateway API.
  • Because the rescue profile config copies auth, secrets, models, etc., token compromise may enable exfiltration of sensitive configuration/secrets via gateway methods.

Vulnerable code:

const ownershipMetadataUnavailable =
  runtime.status !== "stopped" &&
  (runtime.status === "unknown" || runtime.pid == null) &&
  (portUsage.status !== "busy" || portUsage.listeners.length === 0);

const probeLooksLikeGateway =
  probe?.ok === true || looksLikeAuthClose(probe?.close?.code, probe?.close?.reason);

if (probeLooksLikeGateway && (listenerOwnedByRuntime || ownershipMetadataUnavailable)) {
  return;
}

Recommendation

Tighten the port ownership check so the onboarding flow never sends the rescue token unless the listener is strongly attributable to the managed service.

Recommended changes:

  1. Remove the unauthenticated “auth-close (1008)” heuristic as an identity signal. Only accept probe.ok === true and only after verifying listener ownership.
  2. Require verifiable listener ownership:
    • Require inspectPortUsage() to return at least one listener with a PID.
    • Require that PID to match the supervisor-reported PID (or a child PID) when available.
    • If runtime PID is unavailable, fail closed (ask the user to install lsof/run with needed privileges) rather than proceeding.
  3. Consider reducing TOCTOU:
    • Avoid “allocate then release” where possible (e.g., choose a deterministic port and check it’s free right before starting; or retry with a new port if bind fails).

Example safer logic (fail closed when you cannot attribute the port):

const probe = await probeGateway({ url: wsUrl, timeoutMs: 1_000 }).catch(() => null);
if (!probe?.ok) {
  await sleep(POLL);
  continue;
}

if (
  runtime.status !== "running" ||
  typeof runtime.pid !== "number" ||
  portUsage.status !== "busy" ||
  portUsage.listeners.length === 0 ||
  !portUsage.listeners.some(l => l.pid === runtime.pid || l.ppid === runtime.pid)
) {
  throw new Error("Refusing to send rescue token: could not prove managed service owns the port");
}

This ensures the onboarding process does not disclose credentials to an untrusted local listener.


2. 🟡 TOCTOU symlink swap allows LaunchAgent plist write outside intended directory (parent directory not protected)

Property Value
Severity Medium
CWE CWE-367
Location src/daemon/launchd.ts:146-196

Description

The new ensureSecureDirectory() and writeLaunchAgentPlistSecure() add some symlink checks and use O_NOFOLLOW for the temporary file, but there is still a time-of-check/time-of-use gap and symlink traversal via parent directories:

  • installLaunchAgent() checks/creates the LaunchAgents directory via ensureSecureDirectory(path.dirname(plistPath)).
  • Later, writeLaunchAgentPlistSecure() computes dir = path.dirname(plistPath) and opens tempPath with O_NOFOLLOW.
  • O_NOFOLLOW only protects the final path component (tempPath) and does not prevent following a symlink in any parent component (e.g., if ~/Library/LaunchAgents is swapped to a symlink after ensureSecureDirectory returns).
  • Because writeLaunchAgentPlistSecure() only calls lstat() on plistPath/tempPath (which follows symlinks in parent components during path resolution), it does not detect a symlinked parent directory.

Impact (local attack): if an attacker can rename/replace ~/Library/LaunchAgents (possible when ~/Library is group/world-writable, and this patch also removed previous attempts to tighten home/~/Library), they can race-replace LaunchAgents with a symlink to an attacker-controlled directory. The installer will then write the plist into that other location and proceed to call launchctl bootstrap ... plistPath, potentially enabling code execution as the victim user.

Vulnerable code (symlink protection is not anchored to a directory file descriptor; parent components can be swapped):

await ensureSecureDirectory(path.dirname(plistPath));
...
await writeLaunchAgentPlistSecure(plistPath, plist);
const dir = path.dirname(plistPath);
const tempPath = path.join(dir, `.${path.basename(plistPath)}.${process.pid}.tmp`);
const handle = await fs.open(tempPath, O_NOFOLLOW | ...);

Recommendation

Anchor operations to a verified, non-symlink directory and reduce TOCTOU windows.

Practical hardening options (best to combine):

  1. Re-secure parent directories before use (regression fix): reintroduce permission tightening for home and ~/Library (or at least verify they are not group/world-writable) so other users cannot rename/swap LaunchAgents.

  2. Re-check the directory immediately before file creation and before bootstrap, and fail closed if it is a symlink or is group/other-writable.

  3. If possible in your runtime/Node version, use OS primitives that avoid path re-resolution (preferred): open the target directory with O_DIRECTORY | O_NOFOLLOW, then create/write the file relative to that directory using openat-style APIs (or a native helper) so symlink swapping in parent components cannot redirect the write.

Example of additional directory verification (defense-in-depth):

async function assertSecureDir(dir: string) {
  const st = await fs.lstat(dir);
  if (!st.isDirectory() || st.isSymbolicLink()) throw new Error(`Insecure dir: ${dir}`);
  const mode = st.mode & 0o777;
  if ((mode & 0o022) !== 0) throw new Error(`Dir is writable by group/other: ${dir}`);
}

await assertSecureDir(path.dirname(plistPath));
await writeLaunchAgentPlistSecure(plistPath, plist);
await assertSecureDir(path.dirname(plistPath)); // again before launchctl bootstrap

This does not fully eliminate TOCTOU without openat, but it significantly raises the bar and restores the previous protection against group/world-writable parent directories.


3. 🟡 Symlink-following, non-atomic write of auth-profiles.json can overwrite arbitrary files and risk secret exposure

Property Value
Severity Medium
CWE CWE-59
Location src/infra/json-file.ts:16-22

Description

setupRescueWatchdog() (new rescue onboarding flow) copies API keys/tokens into the rescue profile via saveAuthProfileStore(...). That eventually persists credentials using saveJsonFile(), which performs a plain writeFileSync() to the target path.

Because saveJsonFile():

  • does not guard against symlinks (no lstat/O_NOFOLLOW), so a pre-existing auth-profiles.json symlink can redirect the write to an arbitrary file (CWE-59)
  • is not atomic (direct truncation+write), so a crash/kill can leave partially-written JSON and potentially break auth loading
  • creates the file with default permissions first, then chmod(0o600) after; if the parent directory is accessible to other local users (e.g., state dir overridden to a shared path), there is a brief window where secrets may be written with umask-derived permissions (commonly 0644)

This becomes higher impact with rescue watchdog because it intentionally mirrors credentials into the rescue agent directory.

Vulnerable code:

// src/infra/json-file.ts
fs.writeFileSync(pathname, `${JSON.stringify(data, null, 2)}\n`, "utf8");
fs.chmodSync(pathname, 0o600);

Call chain in the rescue flow:

  • src/commands/onboard-rescue.ts calls saveAuthProfileStore(..., rescueAgentDir)
  • src/agents/auth-profiles/store.ts calls saveJsonFile(authPath, payload)
  • src/infra/json-file.ts performs the unsafe write

Recommendation

Harden credential/config writes against symlink attacks and make them atomic.

Recommended changes:

  1. Write atomically: write to a temp file in the same directory with mode 0o600, fsync, then rename.
  2. Refuse symlinks: verify the destination is not a symlink via lstat, and open temp files with O_NOFOLLOW where supported.
  3. Set permissions at creation time (not after), to avoid any transient world-readable window.

Example secure implementation (POSIX-focused; best-effort on Windows):

import { constants as fsConstants } from "node:fs";
import fs from "node:fs/promises";
import path from "node:path";

export async function saveJsonFileAtomicNoSymlink(pathname: string, data: unknown) {
  const dir = path.dirname(pathname);
  await fs.mkdir(dir, { recursive: true, mode: 0o700 });// Refuse writing through an existing symlink
  try {
    const st = await fs.lstat(pathname);
    if (st.isSymbolicLink()) throw new Error("refusing to write to symlink");
  } catch (e: any) {
    if (e?.code !== "ENOENT") throw e;
  }

  const tmp = path.join(dir, `.${path.basename(pathname)}.${process.pid}.tmp`);
  const nofollow = typeof fsConstants.O_NOFOLLOW === "number" ? fsConstants.O_NOFOLLOW : 0;
  const fh = await fs.open(tmp, fsConstants.O_WRONLY | fsConstants.O_CREAT | fsConstants.O_EXCL | nofollow, 0o600);
  try {
    await fh.writeFile(`${JSON.stringify(data, null, 2)}\n`, { encoding: "utf8" });
    await fh.sync();
  } finally {
    await fh.close();
  }
  await fs.rename(tmp, pathname);
}

Then update saveAuthProfileStore() to use this hardened helper for auth-profiles.json writes.


Analyzed PR: #40658 at commit 78a057e

Last updated on: 2026-03-12T13:54:27Z

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation cli CLI command changes commands Command implementations size: L labels Mar 9, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 813e755556

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-rescue.ts
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 9, 2026

Greptile Summary

This PR adds opt-in rescue watchdog onboarding — a second isolated gateway profile that monitors and auto-restarts the primary profile. The implementation is well-scoped, correctly isolates the rescue profile env/state from the primary, preserves rescue-only credentials on re-runs, and includes solid unit + integration test coverage.

Key findings:

  • Security: Math.random() is used to generate the rescue gateway auth token (resolveRescueGatewayToken). Math.random() is not a CSPRNG; the resulting token has ~53 bits of entropy sourced from a predictable RNG. A cryptographic source (crypto.randomBytes or crypto.randomUUID) should be used instead.
  • Robustness: syncRescueAuthProfiles falls back to an empty string when OPENCLAW_STATE_DIR is absent from the rescue env, which silently creates agents/main/agent relative to the process working directory. A guard/assertion would be safer.
  • UX: The interactive rescue watchdog confirm prompt fires unconditionally in quickstart flow (whenever no explicit flag is passed), breaking the quickstart contract of minimal user interaction.

Confidence Score: 3/5

  • The PR is safe to merge with the token generation security issue addressed first.
  • The core logic is sound and tests are thorough, but the use of Math.random() for an auth token that protects a network-accessible gateway service is a meaningful security concern that should be fixed before merging. The other two findings are lower-severity (a defensive coding gap and a minor UX inconsistency in quickstart flow).
  • src/commands/onboard-rescue.ts — token generation and OPENCLAW_STATE_DIR fallback; src/wizard/onboarding.finalize.ts — quickstart flow rescue prompt.

Last reviewed commit: 813e755

Comment thread src/commands/onboard-rescue.ts
Comment thread src/wizard/onboarding.finalize.ts
Comment thread src/commands/onboard-rescue.ts
@shichangs
Copy link
Copy Markdown
Contributor Author

Addressed the review feedback in follow-up commit e0e57e7f2.

What changed:

  • Switched rescue gateway token generation to the existing CSPRNG-backed randomToken() helper instead of Math.random().
  • Added an explicit guard for unresolved rescue OPENCLAW_STATE_DIR before touching the rescue auth store path.
  • Kept QuickStart minimal by only showing the interactive rescue prompt in non-QuickStart flows unless --rescue-watchdog is passed explicitly.
  • Added regression coverage for the QuickStart prompt behavior and kept the setup test covering isolated rescue env/state paths.
  • Updated docs to call out the QuickStart behavior.

Local verification after the follow-up:

  • pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/cli/program/register.onboard.test.ts src/wizard/onboarding.finalize.test.ts
  • pnpm check

A new push is up now, so CI should rerun on the updated commits.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0e57e7f2a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-non-interactive/local.ts Outdated
Comment thread src/commands/onboard-non-interactive/local.ts Outdated
@shichangs
Copy link
Copy Markdown
Contributor Author

Addressed the remaining onboarding/rescue review items in follow-up commit 93b132d78.

What changed:

  • Reconcile existing rescue services against the current install plan and reinstall them when the managed command drifts, instead of only restarting an outdated service.
  • Gate non-interactive rescue enablement before side effects: rescue profiles no longer force daemon install, and Linux runs without user systemd now skip rescue setup cleanly.
  • Added regression coverage for both non-interactive rescue gating and rescue-service reinstall behavior.
  • Re-ran detect-secrets on the full PR file set; .secrets.baseline stays clean on the current head.

Local verification on this head:

  • pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/commands/onboard-non-interactive/local.test.ts src/cli/program/register.onboard.test.ts src/wizard/onboarding.finalize.test.ts
  • pnpm check
  • /usr/bin/python3 -m pre_commit run detect-secrets --files docs/cli/onboard.md docs/reference/wizard.md docs/start/wizard-cli-automation.md docs/start/wizard-cli-reference.md docs/start/wizard.md src/cli/program/register.onboard.test.ts src/cli/program/register.onboard.ts src/commands/onboard-non-interactive/local.ts src/commands/onboard-non-interactive/local/output.ts src/commands/onboard-rescue.setup.test.ts src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.ts src/commands/onboard-types.ts src/wizard/onboarding.finalize.test.ts src/wizard/onboarding.finalize.ts

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93b132d78e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-non-interactive/local.ts Outdated
Comment thread src/wizard/onboarding.finalize.ts Outdated
@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up is in 27f3d63.

This addresses the remaining review items on this PR:

  • non-interactive --rescue-watchdog now exits non-zero if setupRescueWatchdog() fails instead of returning success
  • interactive onboarding now skips rescue provisioning when the primary managed service install failed earlier in the same run
  • rescue setup now rejects invalid monitored profile names before deriving profile/state paths
  • rescue setup now builds its service env from a narrow allowlist, so ambient proxy env does not get copied into the rescue service definition

Validation:

  • pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/commands/onboard-non-interactive/local.test.ts src/commands/onboard-non-interactive/local.run.test.ts src/wizard/onboarding.finalize.test.ts
  • pnpm tsgo

Local pnpm check is still blocked here by an existing unrelated type mismatch in ui/vitest.config.ts (duplicate Vitest browser provider types in node_modules), not by these rescue changes.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27f3d6376e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-non-interactive/local.ts
@openclaw-barnacle openclaw-barnacle Bot added the gateway Gateway runtime label Mar 9, 2026
@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up is in 1a2d69e.

This closes out the remaining onboarding/CI issues on the current head:

  • non-interactive rescue setup now aborts before setupRescueWatchdog() if the primary managed service install failed
  • installGatewayDaemonNonInteractive() now returns success/failure so automation can gate rescue provisioning correctly
  • macOS launchd install paths now tighten directory/file modes and normalize LaunchAgents/Trash paths with POSIX joins, matching the current main expectations that were failing in Windows merge CI
  • detect-secrets now excludes Sparkle sparkle:edSignature appcast metadata so the secrets job no longer flags release signatures as credentials

Validation on this head:

  • pnpm test src/daemon/launchd.test.ts src/commands/onboard-non-interactive/local/daemon-install.test.ts src/commands/onboard-non-interactive/local.run.test.ts src/commands/onboard-non-interactive/local.test.ts src/wizard/onboarding.finalize.test.ts
  • pnpm tsgo
  • /usr/bin/python3 -m pre_commit run detect-secrets --files .detect-secrets.cfg .pre-commit-config.yaml .secrets.baseline appcast.xml

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a2d69e8dd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-rescue.ts Outdated
@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up is in 1807d4e.

This fixes the remaining long-profile collision in rescue onboarding:

  • short monitored profile names still keep the existing <profile>-rescue mapping
  • when a monitored profile must be truncated, the derived rescue profile now appends a stable SHA-256 suffix before -rescue, so different valid profiles that share the same long prefix no longer overwrite each other's rescue config/state
  • added regression coverage for two long valid profile names with the same 57-character prefix to confirm they resolve to distinct rescue profiles and stay within the profile-name limit

Validation on this head:

  • pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts
  • pnpm tsgo

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c455b7524

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-rescue.ts Outdated
Comment thread src/commands/onboard-rescue.ts
@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up is in 22cc27d.

This fixes the two remaining rescue onboarding regressions on the current head:

  • rescue auth sync now preserves rescue-only credentials while letting the primary profile override stale duplicate profile IDs on onboarding reruns, so rotated primary keys propagate into the rescue store
  • rescue config now carries config.env into fresh rescue profiles and keeps rescue-only env keys on reruns while letting the primary profile refresh overlapping env values
  • added regression coverage for both the stale-auth rerun case and env inheritance/merge behavior

Validation on this head:

  • pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts
  • pnpm tsgo

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22cc27d532

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-rescue.ts Outdated
@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up is in b1a4d4249.

This tightens the remaining rescue/launchd hardening on the current head:

  • rescue onboarding now refuses to overwrite an existing rescue / <profile>-rescue profile unless that profile is already marked as an OpenClaw-managed rescue watchdog for the same monitored profile
  • rescue setup allocates an ephemeral loopback port for fresh rescue profiles, preserves an existing managed rescue port on reruns, and proves listener ownership before sending the rescue gateway token
  • rescue watchdog state now carries an explicit wizard.rescueWatchdog ownership marker plus a dedicated least-privilege rescue-watchdog agent id for cron runs
  • rescue cron jobs now pin agentId, keep allowUnsafeExternalContent: false, and use the dedicated watchdog agent instead of the default agent
  • launchd install now rejects symlinked LaunchAgents paths and writes plist files via O_NOFOLLOW + atomic rename
  • config schema metadata/validation now includes the new wizard.rescueWatchdog marker fields

Validation on this head:

  • pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/daemon/launchd.test.ts src/config/schema.help.quality.test.ts
  • pnpm tsgo
  • pnpm check
  • pnpm build

@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up in 05cd77e:

  • Replaced the rescue watchdog cron payload with a deterministic rescueWatchdog runner, so recovery no longer goes through agentTurn prompt execution.
  • Tightened the rescue agent sandbox to allow: [] + deny: ["*"]; the cron job now runs fixed probe/restart/doctor logic and only invokes exact argv for openclaw --profile <profile> doctor --repair --non-interactive.
  • Added cron/schema/store normalization for the new payload kind and covered the deterministic runner with unit tests.

Local verification:

  • pnpm test src/commands/onboard-rescue.test.ts src/commands/onboard-rescue.setup.test.ts src/cron/normalize.test.ts src/cron/service.jobs.test.ts src/cron/rescue-watchdog.test.ts
  • pnpm build
  • pnpm lint

pnpm check still stops in the repo's existing ui/vitest.config.ts Vitest type mismatch during pnpm tsgo; this change does not introduce a new local type-check failure outside that pre-existing blocker.

@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up in 413b857:

  • Fail closed in non-interactive onboarding when OPENCLAW_PROFILE is invalid, before rescue planning can force daemon install.
  • Use the process homedir for macOS LaunchAgent plist/trash paths instead of trusting HOME overrides during service management.
  • Tighten LaunchAgent plist permissions from 0644 to 0600 and stop chmod-tightening the parent ~/Library path.
  • Added regression coverage for invalid rescue profile rejection and conflicting HOME overrides in launchd install.

Local verification on this head:

  • pnpm test src/daemon/launchd.test.ts src/commands/onboard-non-interactive/local.test.ts src/commands/onboard-non-interactive/local.run.test.ts
  • pnpm lint
  • pnpm build
  • pnpm check

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 413b85725e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts Outdated
@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up in e8c8286:

  • The deterministic rescue watchdog now caps doctor --repair --non-interactive to the cron job's remaining timeout budget instead of a fixed 15s+ floor.
  • If restart probing already consumed the configured budget, the watchdog now skips spawning doctor entirely and returns an error, preventing overlapping background repair processes.
  • Added a regression test covering the exhausted-budget path.

Local verification:

  • pnpm test src/cron/rescue-watchdog.test.ts
  • pnpm lint

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8c8286195

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/rescue/watchdog-shared.ts
Comment thread src/cron/rescue-watchdog.ts
@openclaw-barnacle openclaw-barnacle Bot added the app: web-ui App: web-ui label Mar 9, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 621cb9eb23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-rescue.ts Outdated
@shichangs shichangs force-pushed the codex/onboard-rescue-watchdog branch from 621cb9e to f59be9e Compare March 11, 2026 16:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f441a685c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts Outdated
@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up pushed in 1ebacdb.

What changed:

  • Stabilized caps per-probe timeout to the remaining probe budget in src/cron/rescue-watchdog.test.ts by removing the brittle mock.calls[1] assumption.
  • The test now forces a bounded post-restart probe path and asserts against observed probe timeout values without assuming a fixed call index.

Validation:

  • pnpm test src/cron/rescue-watchdog.test.ts
  • pnpm build
  • pnpm check
  • pnpm test ⚠️ fails in this sandbox due unrelated environment constraints (listen EPERM for local socket-binding tests and transient DNS ENOTFOUND in remote-provider tests), not due this test change.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ebacdbad8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts
@shichangs shichangs force-pushed the codex/onboard-rescue-watchdog branch from e9f43ae to 5d17041 Compare March 11, 2026 19:34
@shichangs
Copy link
Copy Markdown
Contributor Author

All review threads have been resolved and CI is green. This PR adds an opt-in rescue watchdog for self-healing gateway recovery during onboarding.

Would appreciate a review when you have a moment — cc @joshavant @tyler6204

@shichangs shichangs force-pushed the codex/onboard-rescue-watchdog branch from 5d17041 to 7e4d128 Compare March 11, 2026 20:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e4d128950

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/wizard/onboarding.finalize.ts
@shichangs shichangs force-pushed the codex/onboard-rescue-watchdog branch from a2bb059 to 706d1a4 Compare March 11, 2026 21:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 706d1a44aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts Outdated
@shichangs shichangs force-pushed the codex/onboard-rescue-watchdog branch from 706d1a4 to b08b7e2 Compare March 11, 2026 22:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0772829039

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts
SC-Claw and others added 5 commits March 12, 2026 12:55
Extend config/cron/daemon type definitions with rescue watchdog fields.
Add watchdog-shared module for building isolated rescue profile environments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…emd, and schtasks

Support installing, starting, and stopping isolated rescue gateway services
across macOS (launchd), Linux (systemd), and Windows (Task Scheduler).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the rescue watchdog cron job that probes the primary gateway
health and triggers doctor-based recovery when the primary is unreachable.
Includes timeout budgeting, probe URL normalization, and job validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire rescue watchdog setup into both interactive and non-interactive
onboarding paths. Add --rescue-watchdog CLI flag, rescue profile
provisioning, daemon service installation, and wizard UX integration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the --rescue-watchdog flag, wizard flow, and automation examples.
Fix unrelated test mock typings exposed during development.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shichangs shichangs force-pushed the codex/onboard-rescue-watchdog branch from 2d57ccd to c645a23 Compare March 12, 2026 04:57
@shichangs
Copy link
Copy Markdown
Contributor Author

Commit history cleanup: squashed 44 incremental commits into 5 logical commits:

  1. feat(rescue): add rescue watchdog core types and shared utilities
  2. feat(daemon): add rescue profile service management for launchd, systemd, and schtasks
  3. feat(cron): add rescue watchdog job engine
  4. feat(onboard): add opt-in rescue watchdog onboarding flow
  5. docs(rescue): add rescue watchdog documentation and stabilize test mocks

Also updated the PR description to fill in all required template sections (Human Verification, Compatibility, Failure Recovery, Risks). This is the final force push for history cleanup — incremental commits only going forward.

Ready for review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c645a2316b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/commands/onboard-rescue.ts
SC-Claw and others added 2 commits March 12, 2026 13:58
…off)

Integrate main's kickstart-based restart with detached handoff logic
while preserving the rescue watchdog signal parameter threading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shichangs shichangs force-pushed the codex/onboard-rescue-watchdog branch from 5d5227c to 65804e6 Compare March 12, 2026 05:59
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d5227c9ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/wizard/onboarding.finalize.ts Outdated
Comment on lines +291 to +295
if (rescueWatchdogEnabled && !primaryManagedServiceReady) {
await prompter.note(
"Rescue watchdog requires a healthy primary managed service. Gateway service install failed during onboarding, so rescue watchdog was skipped.",
"Rescue watchdog",
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail onboarding when explicit rescue setup gets skipped

When --rescue-watchdog is explicitly requested, this branch only shows a note and continues after primary managed-service install fails, so interactive onboarding can exit successfully without provisioning the watchdog the user asked for. This is inconsistent with the explicit-error behavior already used for other rescue setup failures and with the non-interactive path, and it can silently break automation/checklists that treat command success as confirmation that rescue protection is active.

Useful? React with 👍 / 👎.

Comment on lines +499 to +503
try {
return io.loadConfig();
} catch (error) {
throw new Error(
`Rescue watchdog setup failed: existing rescue profile config at "${configPath}" could not be loaded: ${error instanceof Error ? error.message : String(error)}`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle non-ENOENT errors when checking rescue config

This catch turns every fs.access failure into "config missing", so permission/I/O errors (for example EACCES) are treated the same as ENOENT. In that case onboarding skips ownership validation and proceeds as if no rescue profile exists, which can lead to overwriting an existing unmanaged rescue profile or producing confusing follow-on failures instead of failing fast with the real access error.

Useful? React with 👍 / 👎.

@shichangs
Copy link
Copy Markdown
Contributor Author

Addressed two remaining review items in 78a057e:

  • explicit --rescue-watchdog now fails onboarding when primary managed-service install fails (instead of soft-skipping).
  • loadExistingRescueConfig now only treats ENOENT as missing; non-ENOENT access errors fail closed.

Also added regression tests in:

  • src/wizard/onboarding.finalize.test.ts
  • src/commands/onboard-rescue.setup.test.ts

Validation in this sandbox is blocked by dependency install DNS failures (ENOTFOUND registry.npmjs.org), so full pnpm build && pnpm check && pnpm test could not complete.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78a057ea6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

env[key] = value;
}
}
applyCliProfileEnv({ profile, env });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid dev port override in cross-profile watchdog env

buildRescueProfileEnv drops service-identity overrides for cross-profile runs, but then always calls applyCliProfileEnv; for target profile dev, that helper reintroduces OPENCLAW_GATEWAY_PORT=19001 when the variable is absent. In rescue watchdog runs monitoring a dev profile with a non-default configured gateway.port, resolveGatewayPort(cfg, env) will now prefer this injected env value and probe the wrong port, which can misclassify a healthy gateway as down and trigger unnecessary restart/repair actions.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui cli CLI command changes commands Command implementations docs Improvements or additions to documentation gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant