Onboarding: add opt-in rescue watchdog for self-healing gateway recovery by shichangs · Pull Request #44113 · openclaw/openclaw

shichangs · 2026-03-12T14:24:15Z

AI-assisted: Yes (Codex). Fully tested locally.

Summary

Problem: onboarding had no one-click way to provision an isolated rescue gateway profile that could monitor and repair the primary local profile.
Why it matters: if the main gateway goes down, IM-driven self-recovery is unavailable unless a second independently managed profile exists.
What changed: added interactive and non-interactive rescue-watchdog onboarding, isolated rescue profile setup, rescue cron provisioning, and docs/tests for the new flow.
What did NOT change (scope boundary): no changes to remote onboarding, no channel credential cloning into rescue, and no changes to the existing gateway service/install model outside this opt-in flow.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

openclaw onboard can now offer an opt-in rescue watchdog during local onboarding.
openclaw onboard --rescue-watchdog provisions a second isolated rescue profile, managed gateway service, and rescue cron job.
Non-interactive onboarding JSON output now includes rescueWatchdog details when that option is enabled.
Re-running onboarding preserves rescue-only auth profiles instead of overwriting them.
Interactive onboarding now correctly prompts for rescue watchdog when the flag is not explicitly passed.

Security Impact (required)

New permissions/capabilities? (Yes/No): Yes
Secrets/tokens handling changed? (Yes/No): Yes
New/changed network calls? (Yes/No): Yes
Command/tool execution surface changed? (Yes/No): Yes
Data access scope changed? (Yes/No): No
If any Yes, explain risk + mitigation:
This is an explicit opt-in feature that provisions a second local gateway profile with its own token, workspace, service, and cron job. The rescue profile does not clone channel/web config from the primary profile, does not deliver messages externally, and runs isolated cron checks against the primary profile only. Rescue auth-profile syncing preserves rescue-local credentials while merging inherited primary credentials. Environment variable allowlisting prevents sensitive vars from leaking across profiles. Command execution uses process.execPath directly to avoid PATH injection.

Repro + Verification

Environment

OS: macOS
Runtime/container: Node 22 / pnpm workspace
Model/provider: N/A
Integration/channel (if any): N/A
Relevant config (redacted): local onboarding config only

Steps

Run openclaw onboard without passing --rescue-watchdog.
Confirm the wizard offers the rescue watchdog prompt, then enable it or run openclaw onboard --non-interactive --rescue-watchdog ....
Inspect the generated rescue profile config/service/cron and re-run onboarding to verify rescue auth profiles are preserved.

Expected

Interactive onboarding offers the rescue prompt by default.
Rescue setup writes to an isolated rescue profile/state/config path.
Existing rescue-only auth profiles remain present after re-running onboarding.

Actual

Verified by automated tests and local repo checks; behavior now matches the expected outcomes above.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

All 155 rescue-related tests pass (watchdog-shared, rescue-watchdog, onboard-rescue, onboard-rescue.setup, onboarding.finalize, launchd, service.jobs, normalize, register.onboard, local, local.run, daemon-install).

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios:
- --rescue-watchdog is forwarded in CLI mode and remains undefined when not explicitly passed.
- Interactive onboarding prompts for rescue watchdog by default.
- Rescue setup uses isolated rescue profile env/state/config paths.
- Rescue auth store persistence keeps rescue-only credentials while inheriting primary credentials.
- runBoundedStep callback correctly returns Promise<void> (P1 fix from self-review).
Edge cases checked:
- Rescue profile names ending in -rescue remain unsupported (no nested rescue).
- Re-running onboarding does not copy primary cron settings into a fresh rescue profile.
- Rescue install errors preserve the original cause.
- Long profile names are truncated with stable hash suffix.
- Port collision between rescue and primary gateway is handled.
What you did not verify:
- Manual launchd/systemd end-to-end behavior on a real long-running host.
- Real IM-channel recovery against a live production gateway.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes/No): Yes
Config/env changes? (Yes/No): Yes (new rescueWatchdog config section, opt-in only)
Migration needed? (Yes/No): No

Failure Recovery (if this breaks)

How to disable/revert this change quickly: Re-run onboarding without rescue, or remove the rescue profile/service and its cron job.
Files/config to restore: Rescue profile state dir (e.g. ~/.openclaw-rescue*) and its managed gateway service definition.
Known bad symptoms reviewers should watch for: Rescue config being written into the primary profile path, missing rescue prompt in interactive onboarding, or rescue auth profiles disappearing after onboarding reruns.

Risks and Mitigations

Risk: Rescue setup could accidentally write into the active primary profile state/config when onboarding runs under a non-default profile.
- Mitigation: Rescue env construction clears inherited profile-derived env vars before applying the rescue profile.
Risk: Re-running onboarding could erase rescue-local credentials.
- Mitigation: Rescue auth syncing loads from the rescue agent dir so existing rescue credentials are preserved and merged.
Risk: Interactive onboarding could silently skip the rescue prompt.
- Mitigation: CLI flag forwarding preserves undefined for the unset case, and regression tests cover the prompt path.

greptile-apps · 2026-03-12T14:39:18Z

Greptile Summary

This PR adds an opt-in rescue watchdog feature to openclaw onboard: a second isolated local gateway profile with its own service, token, and cron job that auto-restarts the primary profile when health probes fail. The implementation is extensive (~800 lines of new source, 155 new tests) and covers both interactive and non-interactive onboarding flows, cron scheduler integration, isolated profile env construction, and launchd/systemd security hardening.

The launchd security improvements (HOME injection fix, symlink rejection, O_NOFOLLOW + atomic rename, 0o600 plist mode) are notable and appear correct. The RESCUE_ENV_ALLOWLIST approach for env isolation is appropriately conservative. Timeout/abort budget propagation in the watchdog runner is careful and well-tested.

Key issues found:

looksLikeAuthClose is duplicated across src/commands/onboard-rescue.ts and src/cron/rescue-watchdog.ts with a behavioral difference: the onboarding copy is missing the "role" substring check present in the watchdog copy. This could cause waitForRescueGatewayIdentity to time out even when the rescue gateway is healthy and sending a valid role-based auth rejection (1008 close).
buildRescueWatchdogConfig sets cron: undefined on first-time rescue setup. The watchdog cron job is added via callGateway after the gateway starts, but if the gateway treats a missing cron config section as cron-disabled, the job will never fire after initial onboarding.
normalizePayloadToSystemText in src/cron/service/normalize.ts uses an implicit fall-through for the rescueWatchdog kind instead of an explicit type-guarded branch, which is fragile for future payload kinds.

Confidence Score: 3/5

Safe to merge after addressing the looksLikeAuthClose inconsistency and clarifying the cron: undefined first-time behavior.
The PR is well-structured with extensive tests (155 new tests) and good security hardening in launchd. However, two logic issues lower confidence: the divergent looksLikeAuthClose implementations could cause rescue setup to silently fail for certain auth close reasons, and the first-time rescue config leaving cron: undefined carries an undocumented assumption about gateway cron defaults that may not hold.
src/commands/onboard-rescue.ts (duplicate looksLikeAuthClose and first-time cron: undefined), src/cron/service/normalize.ts (implicit rescueWatchdog fall-through in normalizePayloadToSystemText)

Comments Outside Diff (3)

src/commands/onboard-rescue.ts, line 2123-2134 (link)

looksLikeAuthClose diverges from rescue-watchdog.ts sibling

This private copy of looksLikeAuthClose does not check for "role" in the close reason, but the copy in src/cron/rescue-watchdog.ts (lines 3346–3357) does. The version here is used in waitForRescueGatewayIdentity to decide whether a WebSocket 1008 close means the gateway is actually running:
```
const probeLooksLikeGateway =
  probe?.ok === true || looksLikeAuthClose(probe?.close?.code, probe?.close?.reason);
```
If the rescue gateway sends a 1008 close with a reason like "role denied" or "insufficient role", this version will return false, the readiness loop will keep polling until the 15 s deadline is hit, and the whole setupRescueWatchdog call will throw – even though the gateway is up and healthy. The watchdog-side version would correctly identify this as an auth close and report the gateway as reachable.

Both copies should agree on the full set of auth-related substrings. Consider extracting the function to a shared location in src/rescue/watchdog-shared.ts so there is only one source of truth.
src/commands/onboard-rescue.ts, line 2304-2307 (link)

First-time rescue setup leaves cron undefined – watchdog job may never run

On a fresh rescue setup (no existingRescueConfig), existing.cron is undefined, so the rescue config is written with cron: undefined. Immediately afterwards, ensureRescueCronJob adds the watchdog job to the cron store via callGateway. If the gateway treats a missing cron section as disabled by default, the scheduler never processes the job and the watchdog silently does nothing after initial onboarding.

The re-run path (existing.cron ? { ...existing.cron, enabled: true } : undefined) explicitly re-enables cron, but only if the key was already present. The first-time path has no such guarantee.

If the gateway does default cron to enabled, a clarifying comment would make this safe to rely on. If it does not, the first-time config should explicitly set cron: { enabled: true }:
```
cron: existing.cron
  ? { ...existing.cron, enabled: true }
  : { enabled: true },
```
src/cron/service/normalize.ts, line 4061-4065 (link)

Implicit fall-through assumes rescueWatchdog without type-narrowing

After the systemEvent and agentTurn guards, the function falls through to:
```
return `Rescue watchdog: ${payload.monitoredProfile}`.trim();
```
payload is typed as CronPayload (a union), so TypeScript cannot guarantee it is CronRescueWatchdogPayload here — payload.monitoredProfile only exists on that variant and would be undefined on an unexpected future kind. An explicit guard makes this exhaustive and will cause a compile error if a new kind is ever added without updating this function:

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/commands/onboard-rescue.ts
Line: 2123-2134

Comment:
**`looksLikeAuthClose` diverges from `rescue-watchdog.ts` sibling**

This private copy of `looksLikeAuthClose` does **not** check for `"role"` in the close reason, but the copy in `src/cron/rescue-watchdog.ts` (lines 3346–3357) does. The version here is used in `waitForRescueGatewayIdentity` to decide whether a WebSocket 1008 close means the gateway is actually running:

```ts
const probeLooksLikeGateway =
  probe?.ok === true || looksLikeAuthClose(probe?.close?.code, probe?.close?.reason);
```

If the rescue gateway sends a 1008 close with a reason like `"role denied"` or `"insufficient role"`, this version will return `false`, the readiness loop will keep polling until the 15 s deadline is hit, and the whole `setupRescueWatchdog` call will throw – even though the gateway is up and healthy. The watchdog-side version would correctly identify this as an auth close and report the gateway as reachable.

Both copies should agree on the full set of auth-related substrings. Consider extracting the function to a shared location in `src/rescue/watchdog-shared.ts` so there is only one source of truth.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/commands/onboard-rescue.ts
Line: 2304-2307

Comment:
**First-time rescue setup leaves `cron` undefined – watchdog job may never run**

On a fresh rescue setup (no `existingRescueConfig`), `existing.cron` is `undefined`, so the rescue config is written with `cron: undefined`. Immediately afterwards, `ensureRescueCronJob` adds the watchdog job to the cron store via `callGateway`. If the gateway treats a missing `cron` section as *disabled by default*, the scheduler never processes the job and the watchdog silently does nothing after initial onboarding.

The re-run path (`existing.cron ? { ...existing.cron, enabled: true } : undefined`) explicitly re-enables cron, but only if the key was already present. The first-time path has no such guarantee.

If the gateway does default cron to enabled, a clarifying comment would make this safe to rely on. If it does not, the first-time config should explicitly set `cron: { enabled: true }`:

```ts
cron: existing.cron
  ? { ...existing.cron, enabled: true }
  : { enabled: true },
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/normalize.ts
Line: 4061-4065

Comment:
**Implicit fall-through assumes `rescueWatchdog` without type-narrowing**

After the `systemEvent` and `agentTurn` guards, the function falls through to:

```ts
return `Rescue watchdog: ${payload.monitoredProfile}`.trim();
```

`payload` is typed as `CronPayload` (a union), so TypeScript cannot guarantee it is `CronRescueWatchdogPayload` here — `payload.monitoredProfile` only exists on that variant and would be `undefined` on an unexpected future kind. An explicit guard makes this exhaustive and will cause a compile error if a new kind is ever added without updating this function:

```suggestion
  if (payload.kind === "rescueWatchdog") {
    return `Rescue watchdog: ${payload.monitoredProfile}`.trim();
  }
  // TypeScript exhaustiveness check
  const _exhaustive: never = payload;
  return "";
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 8340c37}

shichangs · 2026-03-12T14:56:20Z

Addressed the 3 issues from Greptile review in commits 699dd4087 and 35ffdd139:

looksLikeAuthClose divergence — Added missing "role" substring check in src/commands/onboard-rescue.ts to match the rescue-watchdog.ts version.
cron: undefined on first-time setup — Changed to { enabled: true } so rescue cron is explicitly enabled from the start, not dependent on gateway defaults.
Implicit fall-through in normalizePayloadToSystemText — Added explicit payload.kind === "rescueWatchdog" guard with exhaustiveness check (never assertion) for future-proofing.

All 155 tests pass locally.

shichangs · 2026-03-12T18:08:11Z

@altaywtf @vincentkoc could you review this when you have a chance?

This PR adds the opt-in rescue watchdog across onboarding, cron, and daemon flows. The earlier Greptile findings have been addressed in follow-up commits.

shichangs · 2026-03-13T19:23:59Z

@codex review

chatgpt-codex-connector · 2026-03-13T19:36:53Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

clawsweeper · 2026-04-26T06:23:50Z

Closing this as duplicate or superseded after Codex automated review.

PR #44113 should close as superseded by the later open split PR #46502. The original work is useful prior art and the author addressed review feedback, but current main still does not ship the rescue watchdog/onboarding surface, and #46502 is now the canonical thread for the core rescue watchdog service and cron engine.

Best possible solution:

Close #44113 and keep the active rescue watchdog review on open PR #46502. If #46502 lands, the onboarding flag, wizard prompt, JSON output, and docs from #44113 should come back as a smaller follow-up PR scoped to provisioning UX.

What I checked:

Superseding PR: The provided GitHub context shows open PR Rescue: add watchdog core service and cron engine #46502 explicitly listing Supersedes #44113 and narrowing review to the rescue watchdog core service and cron engine. A read-only remote ref check confirms Rescue: add watchdog core service and cron engine #46502 still has a live PR head at 1ba5b62. (1ba5b624332d)
Current CLI surface absent: Current main's openclaw onboard options include daemon install flags and then --daemon-runtime, but there is no --rescue-watchdog option or forwarded onboarding option. (src/cli/program/register.onboard.ts:160, ee2ab9a64492)
Current cron contract absent: Current main's cron payload union is still only systemEvent or agentTurn; there is no rescueWatchdog payload kind in the shipped type contract. (src/cron/types.ts:112, ee2ab9a64492)
Gateway cron runner still generic: Current main wires cron execution through runIsolatedAgentJob and runCronIsolatedAgentTurn; there is no rescue watchdog runner dependency or dispatch path. (src/gateway/server-cron.ts:333, ee2ab9a64492)
Current onboarding docs have no rescue step: The current CLI setup reference lists local onboarding as model/auth, workspace, gateway, channels, daemon install, health check, and skills setup, with no rescue watchdog step or option documented. Public docs: docs/start/wizard-cli-reference.md. (docs/start/wizard-cli-reference.md:15, ee2ab9a64492)
Review feedback preserved as prior art: The PR discussion included Greptile review findings, followed by the author's comment that commits 699dd40 and 35ffdd1 addressed the three issues. That makes Onboarding: add opt-in rescue watchdog for self-healing gateway recovery #44113 useful prior art, but the later Rescue: add watchdog core service and cron engine #46502 split is the active review path. (35ffdd139db6)

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against ee2ab9a64492.

SC-Claw added 6 commits March 12, 2026 22:16

feat(rescue): add rescue watchdog core types and shared utilities

ac831e2

feat(daemon): add rescue profile service management

41555c7

feat(cron): add rescue watchdog job engine

c725932

feat(onboard): add opt-in rescue watchdog onboarding flow

a6aacb0

docs(rescue): add rescue watchdog documentation and stabilize test mocks

6869915

fix(cron): align restart callback with runBoundedStep void signature

8340c37

openclaw-barnacle Bot added docs Improvements or additions to documentation app: web-ui App: web-ui gateway Gateway runtime cli CLI command changes commands Command implementations agents Agent runtime and tooling size: XL labels Mar 12, 2026

SC-Claw added 2 commits March 12, 2026 22:55

fix(rescue): address Greptile review findings

699dd40

test(rescue): update cron config assertion for explicit enabled default

35ffdd1

This was referenced Mar 14, 2026

Daemon: harden launchd plist writes and abortable service restarts #46493

Closed

Rescue: add watchdog core service and cron engine #46499

Closed

Rescue: add watchdog core service and cron engine #46502

Open

ocdlmv1 mentioned this pull request Mar 22, 2026

The Watchdog Problem — Why OpenClaw Can't Save Itself (Yet) #52196

Closed

clawsweeper Bot closed this Apr 26, 2026

BingqingLyu mentioned this pull request Apr 27, 2026

Rescue: add watchdog core service and cron engine BingqingLyu/openclaw#711

Open

20 tasks

clawsweeper Bot mentioned this pull request Apr 30, 2026

feat: add external rescue watchdog #40791

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Onboarding: add opt-in rescue watchdog for self-healing gateway recovery#44113

Onboarding: add opt-in rescue watchdog for self-healing gateway recovery#44113
shichangs wants to merge 8 commits intoopenclaw:mainfrom
shichangs:rescue-watchdog-v2

shichangs commented Mar 12, 2026

Uh oh!

greptile-apps Bot commented Mar 12, 2026 •

edited

Loading

Comments Outside Diff (3)

Uh oh!

shichangs commented Mar 12, 2026

Uh oh!

shichangs commented Mar 12, 2026

Uh oh!

shichangs commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 13, 2026

Uh oh!

clawsweeper Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shichangs commented Mar 12, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

greptile-apps Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Comments Outside Diff (3)

Uh oh!

shichangs commented Mar 12, 2026

Uh oh!

shichangs commented Mar 12, 2026

Uh oh!

shichangs commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 13, 2026

Uh oh!

clawsweeper Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Mar 12, 2026 •

edited

Loading