Skip to content

Onboarding: add opt-in rescue watchdog for self-healing gateway recovery#44113

Closed
shichangs wants to merge 8 commits intoopenclaw:mainfrom
shichangs:rescue-watchdog-v2
Closed

Onboarding: add opt-in rescue watchdog for self-healing gateway recovery#44113
shichangs wants to merge 8 commits intoopenclaw:mainfrom
shichangs:rescue-watchdog-v2

Conversation

@shichangs
Copy link
Copy Markdown
Contributor

AI-assisted: Yes (Codex). Fully tested locally.

Summary

  • Problem: onboarding had no one-click way to provision an isolated rescue gateway profile that could monitor and repair the primary local profile.
  • Why it matters: if the main gateway goes down, IM-driven self-recovery is unavailable unless a second independently managed profile exists.
  • What changed: added interactive and non-interactive rescue-watchdog onboarding, isolated rescue profile setup, rescue cron provisioning, and docs/tests for the new flow.
  • What did NOT change (scope boundary): no changes to remote onboarding, no channel credential cloning into rescue, and no changes to the existing gateway service/install model outside this opt-in flow.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • openclaw onboard can now offer an opt-in rescue watchdog during local onboarding.
  • openclaw onboard --rescue-watchdog provisions a second isolated rescue profile, managed gateway service, and rescue cron job.
  • Non-interactive onboarding JSON output now includes rescueWatchdog details when that option is enabled.
  • Re-running onboarding preserves rescue-only auth profiles instead of overwriting them.
  • Interactive onboarding now correctly prompts for rescue watchdog when the flag is not explicitly passed.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): Yes
  • Secrets/tokens handling changed? (Yes/No): Yes
  • New/changed network calls? (Yes/No): Yes
  • Command/tool execution surface changed? (Yes/No): Yes
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:
    This is an explicit opt-in feature that provisions a second local gateway profile with its own token, workspace, service, and cron job. The rescue profile does not clone channel/web config from the primary profile, does not deliver messages externally, and runs isolated cron checks against the primary profile only. Rescue auth-profile syncing preserves rescue-local credentials while merging inherited primary credentials. Environment variable allowlisting prevents sensitive vars from leaking across profiles. Command execution uses process.execPath directly to avoid PATH injection.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): local onboarding config only

Steps

  1. Run openclaw onboard without passing --rescue-watchdog.
  2. Confirm the wizard offers the rescue watchdog prompt, then enable it or run openclaw onboard --non-interactive --rescue-watchdog ....
  3. Inspect the generated rescue profile config/service/cron and re-run onboarding to verify rescue auth profiles are preserved.

Expected

  • Interactive onboarding offers the rescue prompt by default.
  • Rescue setup writes to an isolated rescue profile/state/config path.
  • Existing rescue-only auth profiles remain present after re-running onboarding.

Actual

  • Verified by automated tests and local repo checks; behavior now matches the expected outcomes above.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

All 155 rescue-related tests pass (watchdog-shared, rescue-watchdog, onboard-rescue, onboard-rescue.setup, onboarding.finalize, launchd, service.jobs, normalize, register.onboard, local, local.run, daemon-install).

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • --rescue-watchdog is forwarded in CLI mode and remains undefined when not explicitly passed.
    • Interactive onboarding prompts for rescue watchdog by default.
    • Rescue setup uses isolated rescue profile env/state/config paths.
    • Rescue auth store persistence keeps rescue-only credentials while inheriting primary credentials.
    • runBoundedStep callback correctly returns Promise<void> (P1 fix from self-review).
  • Edge cases checked:
    • Rescue profile names ending in -rescue remain unsupported (no nested rescue).
    • Re-running onboarding does not copy primary cron settings into a fresh rescue profile.
    • Rescue install errors preserve the original cause.
    • Long profile names are truncated with stable hash suffix.
    • Port collision between rescue and primary gateway is handled.
  • What you did not verify:
    • Manual launchd/systemd end-to-end behavior on a real long-running host.
    • Real IM-channel recovery against a live production gateway.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): Yes (new rescueWatchdog config section, opt-in only)
  • Migration needed? (Yes/No): No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Re-run onboarding without rescue, or remove the rescue profile/service and its cron job.
  • Files/config to restore: Rescue profile state dir (e.g. ~/.openclaw-rescue*) and its managed gateway service definition.
  • Known bad symptoms reviewers should watch for: Rescue config being written into the primary profile path, missing rescue prompt in interactive onboarding, or rescue auth profiles disappearing after onboarding reruns.

Risks and Mitigations

  • Risk: Rescue setup could accidentally write into the active primary profile state/config when onboarding runs under a non-default profile.
    • Mitigation: Rescue env construction clears inherited profile-derived env vars before applying the rescue profile.
  • Risk: Re-running onboarding could erase rescue-local credentials.
    • Mitigation: Rescue auth syncing loads from the rescue agent dir so existing rescue credentials are preserved and merged.
  • Risk: Interactive onboarding could silently skip the rescue prompt.
    • Mitigation: CLI flag forwarding preserves undefined for the unset case, and regression tests cover the prompt path.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation app: web-ui App: web-ui gateway Gateway runtime cli CLI command changes commands Command implementations agents Agent runtime and tooling size: XL labels Mar 12, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 12, 2026

Greptile Summary

This PR adds an opt-in rescue watchdog feature to openclaw onboard: a second isolated local gateway profile with its own service, token, and cron job that auto-restarts the primary profile when health probes fail. The implementation is extensive (~800 lines of new source, 155 new tests) and covers both interactive and non-interactive onboarding flows, cron scheduler integration, isolated profile env construction, and launchd/systemd security hardening.

The launchd security improvements (HOME injection fix, symlink rejection, O_NOFOLLOW + atomic rename, 0o600 plist mode) are notable and appear correct. The RESCUE_ENV_ALLOWLIST approach for env isolation is appropriately conservative. Timeout/abort budget propagation in the watchdog runner is careful and well-tested.

Key issues found:

  • looksLikeAuthClose is duplicated across src/commands/onboard-rescue.ts and src/cron/rescue-watchdog.ts with a behavioral difference: the onboarding copy is missing the "role" substring check present in the watchdog copy. This could cause waitForRescueGatewayIdentity to time out even when the rescue gateway is healthy and sending a valid role-based auth rejection (1008 close).
  • buildRescueWatchdogConfig sets cron: undefined on first-time rescue setup. The watchdog cron job is added via callGateway after the gateway starts, but if the gateway treats a missing cron config section as cron-disabled, the job will never fire after initial onboarding.
  • normalizePayloadToSystemText in src/cron/service/normalize.ts uses an implicit fall-through for the rescueWatchdog kind instead of an explicit type-guarded branch, which is fragile for future payload kinds.

Confidence Score: 3/5

  • Safe to merge after addressing the looksLikeAuthClose inconsistency and clarifying the cron: undefined first-time behavior.
  • The PR is well-structured with extensive tests (155 new tests) and good security hardening in launchd. However, two logic issues lower confidence: the divergent looksLikeAuthClose implementations could cause rescue setup to silently fail for certain auth close reasons, and the first-time rescue config leaving cron: undefined carries an undocumented assumption about gateway cron defaults that may not hold.
  • src/commands/onboard-rescue.ts (duplicate looksLikeAuthClose and first-time cron: undefined), src/cron/service/normalize.ts (implicit rescueWatchdog fall-through in normalizePayloadToSystemText)

Comments Outside Diff (3)

  1. src/commands/onboard-rescue.ts, line 2123-2134 (link)

    looksLikeAuthClose diverges from rescue-watchdog.ts sibling

    This private copy of looksLikeAuthClose does not check for "role" in the close reason, but the copy in src/cron/rescue-watchdog.ts (lines 3346–3357) does. The version here is used in waitForRescueGatewayIdentity to decide whether a WebSocket 1008 close means the gateway is actually running:

    const probeLooksLikeGateway =
      probe?.ok === true || looksLikeAuthClose(probe?.close?.code, probe?.close?.reason);

    If the rescue gateway sends a 1008 close with a reason like "role denied" or "insufficient role", this version will return false, the readiness loop will keep polling until the 15 s deadline is hit, and the whole setupRescueWatchdog call will throw – even though the gateway is up and healthy. The watchdog-side version would correctly identify this as an auth close and report the gateway as reachable.

    Both copies should agree on the full set of auth-related substrings. Consider extracting the function to a shared location in src/rescue/watchdog-shared.ts so there is only one source of truth.

  2. src/commands/onboard-rescue.ts, line 2304-2307 (link)

    First-time rescue setup leaves cron undefined – watchdog job may never run

    On a fresh rescue setup (no existingRescueConfig), existing.cron is undefined, so the rescue config is written with cron: undefined. Immediately afterwards, ensureRescueCronJob adds the watchdog job to the cron store via callGateway. If the gateway treats a missing cron section as disabled by default, the scheduler never processes the job and the watchdog silently does nothing after initial onboarding.

    The re-run path (existing.cron ? { ...existing.cron, enabled: true } : undefined) explicitly re-enables cron, but only if the key was already present. The first-time path has no such guarantee.

    If the gateway does default cron to enabled, a clarifying comment would make this safe to rely on. If it does not, the first-time config should explicitly set cron: { enabled: true }:

    cron: existing.cron
      ? { ...existing.cron, enabled: true }
      : { enabled: true },
  3. src/cron/service/normalize.ts, line 4061-4065 (link)

    Implicit fall-through assumes rescueWatchdog without type-narrowing

    After the systemEvent and agentTurn guards, the function falls through to:

    return `Rescue watchdog: ${payload.monitoredProfile}`.trim();

    payload is typed as CronPayload (a union), so TypeScript cannot guarantee it is CronRescueWatchdogPayload here — payload.monitoredProfile only exists on that variant and would be undefined on an unexpected future kind. An explicit guard makes this exhaustive and will cause a compile error if a new kind is ever added without updating this function:

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/commands/onboard-rescue.ts
Line: 2123-2134

Comment:
**`looksLikeAuthClose` diverges from `rescue-watchdog.ts` sibling**

This private copy of `looksLikeAuthClose` does **not** check for `"role"` in the close reason, but the copy in `src/cron/rescue-watchdog.ts` (lines 3346–3357) does. The version here is used in `waitForRescueGatewayIdentity` to decide whether a WebSocket 1008 close means the gateway is actually running:

```ts
const probeLooksLikeGateway =
  probe?.ok === true || looksLikeAuthClose(probe?.close?.code, probe?.close?.reason);
```

If the rescue gateway sends a 1008 close with a reason like `"role denied"` or `"insufficient role"`, this version will return `false`, the readiness loop will keep polling until the 15 s deadline is hit, and the whole `setupRescueWatchdog` call will throw – even though the gateway is up and healthy. The watchdog-side version would correctly identify this as an auth close and report the gateway as reachable.

Both copies should agree on the full set of auth-related substrings. Consider extracting the function to a shared location in `src/rescue/watchdog-shared.ts` so there is only one source of truth.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/commands/onboard-rescue.ts
Line: 2304-2307

Comment:
**First-time rescue setup leaves `cron` undefined – watchdog job may never run**

On a fresh rescue setup (no `existingRescueConfig`), `existing.cron` is `undefined`, so the rescue config is written with `cron: undefined`. Immediately afterwards, `ensureRescueCronJob` adds the watchdog job to the cron store via `callGateway`. If the gateway treats a missing `cron` section as *disabled by default*, the scheduler never processes the job and the watchdog silently does nothing after initial onboarding.

The re-run path (`existing.cron ? { ...existing.cron, enabled: true } : undefined`) explicitly re-enables cron, but only if the key was already present. The first-time path has no such guarantee.

If the gateway does default cron to enabled, a clarifying comment would make this safe to rely on. If it does not, the first-time config should explicitly set `cron: { enabled: true }`:

```ts
cron: existing.cron
  ? { ...existing.cron, enabled: true }
  : { enabled: true },
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/normalize.ts
Line: 4061-4065

Comment:
**Implicit fall-through assumes `rescueWatchdog` without type-narrowing**

After the `systemEvent` and `agentTurn` guards, the function falls through to:

```ts
return `Rescue watchdog: ${payload.monitoredProfile}`.trim();
```

`payload` is typed as `CronPayload` (a union), so TypeScript cannot guarantee it is `CronRescueWatchdogPayload` here — `payload.monitoredProfile` only exists on that variant and would be `undefined` on an unexpected future kind. An explicit guard makes this exhaustive and will cause a compile error if a new kind is ever added without updating this function:

```suggestion
  if (payload.kind === "rescueWatchdog") {
    return `Rescue watchdog: ${payload.monitoredProfile}`.trim();
  }
  // TypeScript exhaustiveness check
  const _exhaustive: never = payload;
  return "";
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 8340c37

@shichangs
Copy link
Copy Markdown
Contributor Author

Addressed the 3 issues from Greptile review in commits 699dd4087 and 35ffdd139:

  1. looksLikeAuthClose divergence — Added missing "role" substring check in src/commands/onboard-rescue.ts to match the rescue-watchdog.ts version.
  2. cron: undefined on first-time setup — Changed to { enabled: true } so rescue cron is explicitly enabled from the start, not dependent on gateway defaults.
  3. Implicit fall-through in normalizePayloadToSystemText — Added explicit payload.kind === "rescueWatchdog" guard with exhaustiveness check (never assertion) for future-proofing.

All 155 tests pass locally.

@shichangs
Copy link
Copy Markdown
Contributor Author

@altaywtf @vincentkoc could you review this when you have a chance?

This PR adds the opt-in rescue watchdog across onboarding, cron, and daemon flows. The earlier Greptile findings have been addressed in follow-up commits.

@shichangs
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 26, 2026

Closing this as duplicate or superseded after Codex automated review.

PR #44113 should close as superseded by the later open split PR #46502. The original work is useful prior art and the author addressed review feedback, but current main still does not ship the rescue watchdog/onboarding surface, and #46502 is now the canonical thread for the core rescue watchdog service and cron engine.

Best possible solution:

Close #44113 and keep the active rescue watchdog review on open PR #46502. If #46502 lands, the onboarding flag, wizard prompt, JSON output, and docs from #44113 should come back as a smaller follow-up PR scoped to provisioning UX.

What I checked:

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against ee2ab9a64492.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui cli CLI command changes commands Command implementations docs Improvements or additions to documentation gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant