Rescue: add watchdog core service and cron engine by shichangs · Pull Request #46499 · openclaw/openclaw

shichangs · 2026-03-14T18:27:17Z

Summary

Problem: OpenClaw had no built-in isolated rescue watchdog that could detect an unhealthy managed gateway profile and repair it without going through the failing primary session.
Why it matters: the larger rescue-watchdog feature is hard to review as one PR; this split lands the core runtime/service layer first so maintainers can review the actual watchdog mechanism without onboarding/UI noise.
What changed: adds daemon/service hardening needed by the watchdog, introduces a new isolated rescueWatchdog cron payload + runner, and wires the gateway cron service to execute that watchdog job against a monitored profile.
What did NOT change (scope boundary): no onboarding wizard flow, no CLI openclaw onboard --rescue-watchdog UX, and no docs changes in this PR.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Adds a new internal isolated cron payload kind, rescueWatchdog, for watchdog jobs that monitor a target profile and repair it when unhealthy.
Hardens LaunchAgent plist writes: trusted-home resolution, symlink rejection, atomic writes, and 0600 plist permissions.
Managed service restart helpers now accept AbortSignal, so watchdog restart attempts can be canceled cleanly instead of hanging until command timeouts.

Security Impact (required)

New permissions/capabilities? (Yes/No): Yes
Secrets/tokens handling changed? (Yes/No): No
New/changed network calls? (Yes/No): Yes
Command/tool execution surface changed? (Yes/No): Yes
Data access scope changed? (Yes/No): No
If any Yes, explain risk + mitigation:
The new capability is intentionally narrow: an isolated cron job can probe and repair a monitored local profile. Risk is bounded by restricting the payload shape, disallowing delivery targets for rescueWatchdog, resolving the monitored profile explicitly, and hardening the underlying managed-service restart and launchd plist write paths. The repair fallback uses exact argv via process.execPath and avoids shell-based PATH resolution.

Repro + Verification

Environment

OS: macOS
Runtime/container: Node 22 / pnpm workspace
Model/provider: N/A
Integration/channel (if any): N/A
Relevant config (redacted): local managed gateway service / cron service only

Steps

Run the daemon/unit tests covering launchd, systemd, schtasks, and process-exec timeout behavior.
Run the cron/rescue unit tests covering payload normalization, job validation, watchdog execution, and gateway cron wiring.
Build the repo and validate config to confirm the combined core layer compiles and loads cleanly.

Expected

A rescueWatchdog cron job can be normalized, stored, validated, and executed as an isolated job.
Managed service restart helpers are abortable and launchd plist writes fail closed on unsafe targets.

Actual

Verified by targeted tests and local build/config validation; behavior matches the expected outcomes above.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios:
- launchd plist path resolution ignores a conflicting HOME override, rejects symlink targets, and writes plist files with 0600 permissions via temp-file + rename.
- AbortSignal now propagates through launchd/systemd/schtasks restart helpers into the process exec layer.
- rescueWatchdog cron payloads normalize correctly, are forced to isolated session targets, and reject delivery/failureDestination config.
- the rescue watchdog runner probes the monitored profile, attempts managed service restart, falls back to doctor --repair --non-interactive, and reports success/error summaries correctly.
Edge cases checked:
- os.userInfo() failure falls back to os.homedir(), and unresolved trusted home fails closed.
- monitored rescue-profile names are rejected.
- restart/repair flows respect cron timeout budget and abort behavior.
- launchd temp plist files are cleaned up if rename fails.
What you did not verify:
- manual end-to-end watchdog repair on a real long-running host with an actually failing managed gateway.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes/No): Yes
Config/env changes? (Yes/No): No
Migration needed? (Yes/No): No

Failure Recovery (if this breaks)

How to disable/revert this change quickly: revert this PR.
Files/config to restore: src/daemon/*, src/process/exec.ts, src/cron/*, src/rescue/watchdog-shared.ts, src/gateway/server-cron.ts
Known bad symptoms reviewers should watch for: launchd install unexpectedly failing on valid paths, or isolated cron jobs rejecting valid agentTurn jobs after the new payload kind was added.

Risks and Mitigations

Risk: LaunchAgent path hardening could reject previously tolerated but unsafe filesystem layouts.
- Mitigation: the checks are limited to symlinked/non-directory targets in the LaunchAgents path and have focused test coverage.
Risk: adding a new cron payload kind could accidentally loosen validation or delivery behavior.
- Mitigation: rescueWatchdog is explicitly restricted to isolated jobs, delivery is rejected, and schema/service/normalization paths have dedicated tests.
Risk: watchdog repair code could leave stuck restart attempts behind.
- Mitigation: abort propagation is threaded through the restart backends and the watchdog enforces bounded restart/repair timeouts.

shichangs · 2026-03-14T18:31:03Z

Superseded by #46502.

I rebased the daemon + cron core stack onto the latest main and opened a fresh PR instead of force-pushing over this branch. That keeps the review entry non-conflicting and preserves incremental history.

greptile-apps · 2026-03-14T18:33:14Z

Greptile Summary

This PR lands the core service layer for the rescue-watchdog feature: it introduces the rescueWatchdog cron payload kind (with normalized validation, session-target enforcement, and delivery rejection), adds an isolated job runner that probes a monitored profile's gateway and escalates through managed-service restart → doctor --repair when unhealthy, and hardens the daemon restart paths with AbortSignal propagation and secure LaunchAgent plist writes (trusted-home resolution, symlink rejection, atomic O_NOFOLLOW writes, 0600 plist mode).

Key findings:

resolveGatewayService() called before the initial probe (src/cron/rescue-watchdog.ts): On systems without managed-service support the watchdog returns an error on every scheduled run — even when the monitored gateway is healthy — because the service-availability check throws before the healthy early-return is reached. Moving that check to after initialProbe would fix this.
Undocumented removal of "current" session-target resolution (src/cron/normalize.ts): The block that resolved sessionTarget === "current" to session:<key> or "isolated" was deleted without mention in the PR's behavior-change section. The surrounding code comment still lists "current" as a valid explicit value, so callers relying on implicit resolution will be silently broken.
SIGKILL without prior SIGTERM (src/process/exec.ts): The abort handler immediately sends SIGKILL to the child process, bypassing any graceful shutdown the subprocess (e.g. doctor --repair) might have registered. For short-lived daemon commands this is fine; a comment explaining the deliberate choice (or a two-stage kill) would benefit future readers.

Confidence Score: 3/5

Safe to merge with the resolveGatewayService/probe ordering fixed; the "current" session-target removal should be clarified before merge.
The security hardening in launchd, the abort-signal plumbing across all daemon backends, and the cron payload infrastructure are all well-implemented and well-tested. However, two logic-level issues lower confidence: (1) the watchdog always returns an error for non-managed-service hosts because resolveGatewayService() is called before the health probe, and (2) the silent removal of "current" session-target resolution is an undocumented behavior change that could regress existing callers. Neither is a critical security issue, but both are correctness problems that should be addressed before the feature is relied upon in production.
src/cron/rescue-watchdog.ts (service-before-probe ordering) and src/cron/normalize.ts ("current" session-target removal).

Comments Outside Diff (3)

src/cron/rescue-watchdog.ts, line 919-927 (link)

Service availability checked before initial probe

resolveGatewayService() is called unconditionally — before the initial probe that checks whether the gateway is even unhealthy. If the gateway is healthy, the function returns early, but that early return is never reached when resolveGatewayService() throws first.

This means on any system running the gateway without managed-service support (e.g., direct node invocation, dev setup, CI), the watchdog will always return status: "error" with "gateway service control unavailable" — even when the monitored gateway is perfectly healthy. The watchdog fires spurious errors on every single scheduled run.

Moving the resolveGatewayService() call (and its error guard) to after the initial probe check would fix this without changing the repair flow:
```
const initialProbe = await probeProfileGateway({ cfg, port, auth });
if (initialProbe.healthy) {
  return { status: "ok", summary: buildSummary(monitoredProfile, actions) };
}

let service;
try {
  service = resolveGatewayService();
} catch (error) {
  return {
    status: "error",
    error: `gateway service control unavailable: ...`,
  };
}
```
src/cron/normalize.ts, line 481-490 (link)

Undocumented removal of "current" session-target resolution

The block that resolved sessionTarget === "current" to either session:<key> (if a session context was available) or "isolated" (as a safe fallback) was removed. The comment two lines above — "Users must explicitly specify "current" or "session:xxx" for custom session binding" — still explicitly lists "current" as a valid value users may supply.

After this removal, any job creation or update that passes sessionTarget: "current" will no longer be resolved; the value propagates as-is through the normalization pipeline and likely reaches the execution layer in an unresolved state. Existing stored jobs that had "current" resolved at creation time are unaffected, but any user or integration that relies on passing "current" at job-create/update time will silently receive unexpected behavior.

This behavior change is not listed under "User-visible / Behavior Changes" in the PR description. If the intent is to remove support for "current" entirely, both the schema/type union and the surrounding comment should reflect that; if it's intentional but deferred, it should be documented.
src/process/exec.ts, line 222-234 (link)

Abort sends SIGKILL without a prior SIGTERM

When the AbortSignal fires, the handler sends SIGKILL immediately:
```
child.kill("SIGKILL");
```
For short-lived launchctl / systemctl / schtasks commands this is unlikely to matter, but doctor --repair is bounded at up to 60 s and could hold file-system or IPC resources (SQLite writes, lock files, named pipes). Sending SIGKILL without first attempting SIGTERM prevents the subprocess from performing any cleanup it may have registered (signal handlers, process.on("SIGTERM") in a Node child, etc.).

Consider a two-stage approach — SIGTERM first, then SIGKILL after a short grace period (e.g. 2 s) — for commands where graceful shutdown matters. At minimum, a code comment explaining the deliberate SIGKILL choice would help future readers.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/cron/rescue-watchdog.ts
Line: 919-927

Comment:
**Service availability checked before initial probe**

`resolveGatewayService()` is called unconditionally — before the initial probe that checks whether the gateway is even unhealthy. If the gateway is healthy, the function returns early, but that early return is never reached when `resolveGatewayService()` throws first.

This means on any system running the gateway *without* managed-service support (e.g., direct `node` invocation, dev setup, CI), the watchdog will always return `status: "error"` with `"gateway service control unavailable"` — even when the monitored gateway is perfectly healthy. The watchdog fires spurious errors on every single scheduled run.

Moving the `resolveGatewayService()` call (and its error guard) to after the initial probe check would fix this without changing the repair flow:

```
const initialProbe = await probeProfileGateway({ cfg, port, auth });
if (initialProbe.healthy) {
  return { status: "ok", summary: buildSummary(monitoredProfile, actions) };
}

let service;
try {
  service = resolveGatewayService();
} catch (error) {
  return {
    status: "error",
    error: `gateway service control unavailable: ...`,
  };
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/normalize.ts
Line: 481-490

Comment:
**Undocumented removal of `"current"` session-target resolution**

The block that resolved `sessionTarget === "current"` to either `session:<key>` (if a session context was available) or `"isolated"` (as a safe fallback) was removed. The comment two lines above — *"Users must explicitly specify `"current"` or `"session:xxx"` for custom session binding"* — still explicitly lists `"current"` as a valid value users may supply.

After this removal, any job creation or update that passes `sessionTarget: "current"` will no longer be resolved; the value propagates as-is through the normalization pipeline and likely reaches the execution layer in an unresolved state. Existing stored jobs that had `"current"` resolved at creation time are unaffected, but any user or integration that relies on passing `"current"` at job-create/update time will silently receive unexpected behavior.

This behavior change is not listed under "User-visible / Behavior Changes" in the PR description. If the intent is to remove support for `"current"` entirely, both the schema/type union and the surrounding comment should reflect that; if it's intentional but deferred, it should be documented.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/process/exec.ts
Line: 222-234

Comment:
**Abort sends `SIGKILL` without a prior `SIGTERM`**

When the `AbortSignal` fires, the handler sends `SIGKILL` immediately:

```typescript
child.kill("SIGKILL");
```

For short-lived launchctl / systemctl / schtasks commands this is unlikely to matter, but `doctor --repair` is bounded at up to 60 s and could hold file-system or IPC resources (SQLite writes, lock files, named pipes). Sending SIGKILL without first attempting SIGTERM prevents the subprocess from performing any cleanup it may have registered (signal handlers, `process.on("SIGTERM")` in a Node child, etc.).

Consider a two-stage approach — SIGTERM first, then SIGKILL after a short grace period (e.g. 2 s) — for commands where graceful shutdown matters. At minimum, a code comment explaining the deliberate SIGKILL choice would help future readers.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 5dac776}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5dac7762cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-14T18:34:47Z

+  if (raw === "rescuewatchdog") {
+    payload.kind = "rescueWatchdog";
+    return true;


Avoid marking canonical rescue payload kinds as migrated

When payload.kind is already "rescueWatchdog", this branch still returns true, so normalizeStoredCronJobs treats every load as a mutation and persists the store again. That causes unnecessary rewrites/churn (and inflates legacyPayloadKind issue counts) for any installation that has rescue watchdog jobs, even when the data is already normalized.

Useful? React with 👍 / 👎.

SC-Claw added 3 commits March 15, 2026 02:07

Daemon: harden launchd plist writes and abortable service restarts

c83aca2

Cron: add isolated rescue watchdog job engine

15f7a5a

Daemon: clean up trusted launchd plist path handling

5dac776

shichangs mentioned this pull request Mar 14, 2026

Daemon: harden launchd plist writes and abortable service restarts #46493

Closed

20 tasks

openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime size: XL labels Mar 14, 2026

shichangs mentioned this pull request Mar 14, 2026

Rescue: add watchdog core service and cron engine #46502

Open

20 tasks

shichangs closed this Mar 14, 2026

chatgpt-codex-connector Bot reviewed Mar 14, 2026

View reviewed changes

ocdlmv1 mentioned this pull request Mar 22, 2026

The Watchdog Problem — Why OpenClaw Can't Save Itself (Yet) #52196

Closed

BingqingLyu mentioned this pull request Apr 27, 2026

Rescue: add watchdog core service and cron engine BingqingLyu/openclaw#711

Open

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rescue: add watchdog core service and cron engine#46499

Rescue: add watchdog core service and cron engine#46499
shichangs wants to merge 3 commits intoopenclaw:mainfrom
shichangs:codex/rescue-watchdog-cron

shichangs commented Mar 14, 2026

Uh oh!

shichangs commented Mar 14, 2026

Uh oh!

greptile-apps Bot commented Mar 14, 2026 •

edited

Loading

Comments Outside Diff (3)

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shichangs commented Mar 14, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

shichangs commented Mar 14, 2026

Uh oh!

greptile-apps Bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Comments Outside Diff (3)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Mar 14, 2026 •

edited

Loading