Skip to content

Rescue: add watchdog core service and cron engine#46499

Closed
shichangs wants to merge 3 commits intoopenclaw:mainfrom
shichangs:codex/rescue-watchdog-cron
Closed

Rescue: add watchdog core service and cron engine#46499
shichangs wants to merge 3 commits intoopenclaw:mainfrom
shichangs:codex/rescue-watchdog-cron

Conversation

@shichangs
Copy link
Copy Markdown
Contributor

Summary

  • Problem: OpenClaw had no built-in isolated rescue watchdog that could detect an unhealthy managed gateway profile and repair it without going through the failing primary session.
  • Why it matters: the larger rescue-watchdog feature is hard to review as one PR; this split lands the core runtime/service layer first so maintainers can review the actual watchdog mechanism without onboarding/UI noise.
  • What changed: adds daemon/service hardening needed by the watchdog, introduces a new isolated rescueWatchdog cron payload + runner, and wires the gateway cron service to execute that watchdog job against a monitored profile.
  • What did NOT change (scope boundary): no onboarding wizard flow, no CLI openclaw onboard --rescue-watchdog UX, and no docs changes in this PR.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Adds a new internal isolated cron payload kind, rescueWatchdog, for watchdog jobs that monitor a target profile and repair it when unhealthy.
  • Hardens LaunchAgent plist writes: trusted-home resolution, symlink rejection, atomic writes, and 0600 plist permissions.
  • Managed service restart helpers now accept AbortSignal, so watchdog restart attempts can be canceled cleanly instead of hanging until command timeouts.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): Yes
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): Yes
  • Command/tool execution surface changed? (Yes/No): Yes
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:
    The new capability is intentionally narrow: an isolated cron job can probe and repair a monitored local profile. Risk is bounded by restricting the payload shape, disallowing delivery targets for rescueWatchdog, resolving the monitored profile explicitly, and hardening the underlying managed-service restart and launchd plist write paths. The repair fallback uses exact argv via process.execPath and avoids shell-based PATH resolution.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): local managed gateway service / cron service only

Steps

  1. Run the daemon/unit tests covering launchd, systemd, schtasks, and process-exec timeout behavior.
  2. Run the cron/rescue unit tests covering payload normalization, job validation, watchdog execution, and gateway cron wiring.
  3. Build the repo and validate config to confirm the combined core layer compiles and loads cleanly.

Expected

  • A rescueWatchdog cron job can be normalized, stored, validated, and executed as an isolated job.
  • Managed service restart helpers are abortable and launchd plist writes fail closed on unsafe targets.

Actual

  • Verified by targeted tests and local build/config validation; behavior matches the expected outcomes above.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • launchd plist path resolution ignores a conflicting HOME override, rejects symlink targets, and writes plist files with 0600 permissions via temp-file + rename.
    • AbortSignal now propagates through launchd/systemd/schtasks restart helpers into the process exec layer.
    • rescueWatchdog cron payloads normalize correctly, are forced to isolated session targets, and reject delivery/failureDestination config.
    • the rescue watchdog runner probes the monitored profile, attempts managed service restart, falls back to doctor --repair --non-interactive, and reports success/error summaries correctly.
  • Edge cases checked:
    • os.userInfo() failure falls back to os.homedir(), and unresolved trusted home fails closed.
    • monitored rescue-profile names are rejected.
    • restart/repair flows respect cron timeout budget and abort behavior.
    • launchd temp plist files are cleaned up if rename fails.
  • What you did not verify:
    • manual end-to-end watchdog repair on a real long-running host with an actually failing managed gateway.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this PR.
  • Files/config to restore: src/daemon/*, src/process/exec.ts, src/cron/*, src/rescue/watchdog-shared.ts, src/gateway/server-cron.ts
  • Known bad symptoms reviewers should watch for: launchd install unexpectedly failing on valid paths, or isolated cron jobs rejecting valid agentTurn jobs after the new payload kind was added.

Risks and Mitigations

  • Risk: LaunchAgent path hardening could reject previously tolerated but unsafe filesystem layouts.
    • Mitigation: the checks are limited to symlinked/non-directory targets in the LaunchAgents path and have focused test coverage.
  • Risk: adding a new cron payload kind could accidentally loosen validation or delivery behavior.
    • Mitigation: rescueWatchdog is explicitly restricted to isolated jobs, delivery is rejected, and schema/service/normalization paths have dedicated tests.
  • Risk: watchdog repair code could leave stuck restart attempts behind.
    • Mitigation: abort propagation is threaded through the restart backends and the watchdog enforces bounded restart/repair timeouts.

@shichangs
Copy link
Copy Markdown
Contributor Author

Superseded by #46502.

I rebased the daemon + cron core stack onto the latest main and opened a fresh PR instead of force-pushing over this branch. That keeps the review entry non-conflicting and preserves incremental history.

@shichangs shichangs closed this Mar 14, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 14, 2026

Greptile Summary

This PR lands the core service layer for the rescue-watchdog feature: it introduces the rescueWatchdog cron payload kind (with normalized validation, session-target enforcement, and delivery rejection), adds an isolated job runner that probes a monitored profile's gateway and escalates through managed-service restart → doctor --repair when unhealthy, and hardens the daemon restart paths with AbortSignal propagation and secure LaunchAgent plist writes (trusted-home resolution, symlink rejection, atomic O_NOFOLLOW writes, 0600 plist mode).

Key findings:

  • resolveGatewayService() called before the initial probe (src/cron/rescue-watchdog.ts): On systems without managed-service support the watchdog returns an error on every scheduled run — even when the monitored gateway is healthy — because the service-availability check throws before the healthy early-return is reached. Moving that check to after initialProbe would fix this.
  • Undocumented removal of "current" session-target resolution (src/cron/normalize.ts): The block that resolved sessionTarget === "current" to session:<key> or "isolated" was deleted without mention in the PR's behavior-change section. The surrounding code comment still lists "current" as a valid explicit value, so callers relying on implicit resolution will be silently broken.
  • SIGKILL without prior SIGTERM (src/process/exec.ts): The abort handler immediately sends SIGKILL to the child process, bypassing any graceful shutdown the subprocess (e.g. doctor --repair) might have registered. For short-lived daemon commands this is fine; a comment explaining the deliberate choice (or a two-stage kill) would benefit future readers.

Confidence Score: 3/5

  • Safe to merge with the resolveGatewayService/probe ordering fixed; the "current" session-target removal should be clarified before merge.
  • The security hardening in launchd, the abort-signal plumbing across all daemon backends, and the cron payload infrastructure are all well-implemented and well-tested. However, two logic-level issues lower confidence: (1) the watchdog always returns an error for non-managed-service hosts because resolveGatewayService() is called before the health probe, and (2) the silent removal of "current" session-target resolution is an undocumented behavior change that could regress existing callers. Neither is a critical security issue, but both are correctness problems that should be addressed before the feature is relied upon in production.
  • src/cron/rescue-watchdog.ts (service-before-probe ordering) and src/cron/normalize.ts ("current" session-target removal).

Comments Outside Diff (3)

  1. src/cron/rescue-watchdog.ts, line 919-927 (link)

    Service availability checked before initial probe

    resolveGatewayService() is called unconditionally — before the initial probe that checks whether the gateway is even unhealthy. If the gateway is healthy, the function returns early, but that early return is never reached when resolveGatewayService() throws first.

    This means on any system running the gateway without managed-service support (e.g., direct node invocation, dev setup, CI), the watchdog will always return status: "error" with "gateway service control unavailable" — even when the monitored gateway is perfectly healthy. The watchdog fires spurious errors on every single scheduled run.

    Moving the resolveGatewayService() call (and its error guard) to after the initial probe check would fix this without changing the repair flow:

    const initialProbe = await probeProfileGateway({ cfg, port, auth });
    if (initialProbe.healthy) {
      return { status: "ok", summary: buildSummary(monitoredProfile, actions) };
    }
    
    let service;
    try {
      service = resolveGatewayService();
    } catch (error) {
      return {
        status: "error",
        error: `gateway service control unavailable: ...`,
      };
    }
    
  2. src/cron/normalize.ts, line 481-490 (link)

    Undocumented removal of "current" session-target resolution

    The block that resolved sessionTarget === "current" to either session:<key> (if a session context was available) or "isolated" (as a safe fallback) was removed. The comment two lines above — "Users must explicitly specify "current" or "session:xxx" for custom session binding" — still explicitly lists "current" as a valid value users may supply.

    After this removal, any job creation or update that passes sessionTarget: "current" will no longer be resolved; the value propagates as-is through the normalization pipeline and likely reaches the execution layer in an unresolved state. Existing stored jobs that had "current" resolved at creation time are unaffected, but any user or integration that relies on passing "current" at job-create/update time will silently receive unexpected behavior.

    This behavior change is not listed under "User-visible / Behavior Changes" in the PR description. If the intent is to remove support for "current" entirely, both the schema/type union and the surrounding comment should reflect that; if it's intentional but deferred, it should be documented.

  3. src/process/exec.ts, line 222-234 (link)

    Abort sends SIGKILL without a prior SIGTERM

    When the AbortSignal fires, the handler sends SIGKILL immediately:

    child.kill("SIGKILL");

    For short-lived launchctl / systemctl / schtasks commands this is unlikely to matter, but doctor --repair is bounded at up to 60 s and could hold file-system or IPC resources (SQLite writes, lock files, named pipes). Sending SIGKILL without first attempting SIGTERM prevents the subprocess from performing any cleanup it may have registered (signal handlers, process.on("SIGTERM") in a Node child, etc.).

    Consider a two-stage approach — SIGTERM first, then SIGKILL after a short grace period (e.g. 2 s) — for commands where graceful shutdown matters. At minimum, a code comment explaining the deliberate SIGKILL choice would help future readers.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cron/rescue-watchdog.ts
Line: 919-927

Comment:
**Service availability checked before initial probe**

`resolveGatewayService()` is called unconditionally — before the initial probe that checks whether the gateway is even unhealthy. If the gateway is healthy, the function returns early, but that early return is never reached when `resolveGatewayService()` throws first.

This means on any system running the gateway *without* managed-service support (e.g., direct `node` invocation, dev setup, CI), the watchdog will always return `status: "error"` with `"gateway service control unavailable"` — even when the monitored gateway is perfectly healthy. The watchdog fires spurious errors on every single scheduled run.

Moving the `resolveGatewayService()` call (and its error guard) to after the initial probe check would fix this without changing the repair flow:

```
const initialProbe = await probeProfileGateway({ cfg, port, auth });
if (initialProbe.healthy) {
  return { status: "ok", summary: buildSummary(monitoredProfile, actions) };
}

let service;
try {
  service = resolveGatewayService();
} catch (error) {
  return {
    status: "error",
    error: `gateway service control unavailable: ...`,
  };
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/normalize.ts
Line: 481-490

Comment:
**Undocumented removal of `"current"` session-target resolution**

The block that resolved `sessionTarget === "current"` to either `session:<key>` (if a session context was available) or `"isolated"` (as a safe fallback) was removed. The comment two lines above — *"Users must explicitly specify `"current"` or `"session:xxx"` for custom session binding"* — still explicitly lists `"current"` as a valid value users may supply.

After this removal, any job creation or update that passes `sessionTarget: "current"` will no longer be resolved; the value propagates as-is through the normalization pipeline and likely reaches the execution layer in an unresolved state. Existing stored jobs that had `"current"` resolved at creation time are unaffected, but any user or integration that relies on passing `"current"` at job-create/update time will silently receive unexpected behavior.

This behavior change is not listed under "User-visible / Behavior Changes" in the PR description. If the intent is to remove support for `"current"` entirely, both the schema/type union and the surrounding comment should reflect that; if it's intentional but deferred, it should be documented.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/process/exec.ts
Line: 222-234

Comment:
**Abort sends `SIGKILL` without a prior `SIGTERM`**

When the `AbortSignal` fires, the handler sends `SIGKILL` immediately:

```typescript
child.kill("SIGKILL");
```

For short-lived launchctl / systemctl / schtasks commands this is unlikely to matter, but `doctor --repair` is bounded at up to 60 s and could hold file-system or IPC resources (SQLite writes, lock files, named pipes). Sending SIGKILL without first attempting SIGTERM prevents the subprocess from performing any cleanup it may have registered (signal handlers, `process.on("SIGTERM")` in a Node child, etc.).

Consider a two-stage approach — SIGTERM first, then SIGKILL after a short grace period (e.g. 2 s) — for commands where graceful shutdown matters. At minimum, a code comment explaining the deliberate SIGKILL choice would help future readers.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 5dac776

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5dac7762cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +39 to +41
if (raw === "rescuewatchdog") {
payload.kind = "rescueWatchdog";
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid marking canonical rescue payload kinds as migrated

When payload.kind is already "rescueWatchdog", this branch still returns true, so normalizeStoredCronJobs treats every load as a mutation and persists the store again. That causes unnecessary rewrites/churn (and inflates legacyPayloadKind issue counts) for any installation that has rescue watchdog jobs, even when the data is already normalized.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant