Skip to content

Rescue: add watchdog core service and cron engine#46502

Open
shichangs wants to merge 60 commits intoopenclaw:mainfrom
shichangs:codex/rescue-watchdog-core
Open

Rescue: add watchdog core service and cron engine#46502
shichangs wants to merge 60 commits intoopenclaw:mainfrom
shichangs:codex/rescue-watchdog-core

Conversation

@shichangs
Copy link
Copy Markdown
Contributor

Summary

  • Problem: OpenClaw had no built-in isolated rescue watchdog that could detect an unhealthy managed gateway profile and repair it without going through the failing primary session.
  • Why it matters: the larger rescue-watchdog feature is hard to review as one PR; this split lands the core runtime/service layer first so maintainers can review the actual watchdog mechanism without onboarding/UI noise.
  • What changed: adds daemon/service hardening needed by the watchdog, introduces a new isolated rescueWatchdog cron payload + runner, and wires the gateway cron service to execute that watchdog job against a monitored profile.
  • What did NOT change (scope boundary): no onboarding wizard flow, no CLI openclaw onboard --rescue-watchdog UX, and no docs changes in this PR.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Adds a new internal isolated cron payload kind, rescueWatchdog, for watchdog jobs that monitor a target profile and repair it when unhealthy.
  • Hardens LaunchAgent plist writes: trusted-home resolution, symlink rejection, atomic writes, and 0600 plist permissions.
  • Managed service restart helpers now accept AbortSignal, so watchdog restart attempts can be canceled cleanly instead of hanging until command timeouts.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): Yes
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): Yes
  • Command/tool execution surface changed? (Yes/No): Yes
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:
    The new capability is intentionally narrow: an isolated cron job can probe and repair a monitored local profile. Risk is bounded by restricting the payload shape, disallowing delivery targets for rescueWatchdog, resolving the monitored profile explicitly, and hardening the underlying managed-service restart and launchd plist write paths. The repair fallback uses exact argv via process.execPath and avoids shell-based PATH resolution.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): local managed gateway service / cron service only

Steps

  1. Run the daemon/unit tests covering launchd, systemd, schtasks, and process-exec timeout behavior.
  2. Run the cron/rescue unit tests covering payload normalization, job validation, watchdog execution, and gateway cron wiring.
  3. Build the repo and validate config to confirm the combined core layer compiles and loads cleanly.

Expected

  • A rescueWatchdog cron job can be normalized, stored, validated, and executed as an isolated job.
  • Managed service restart helpers are abortable and launchd plist writes fail closed on unsafe targets.

Actual

  • Verified by targeted tests and local build/config validation; behavior matches the expected outcomes above.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • launchd plist path resolution ignores a conflicting HOME override, rejects symlink targets, and writes plist files with 0600 permissions via temp-file + rename.
    • AbortSignal now propagates through launchd/systemd/schtasks restart helpers into the process exec layer.
    • rescueWatchdog cron payloads normalize correctly, are forced to isolated session targets, and reject delivery/failureDestination config.
    • the rescue watchdog runner probes the monitored profile, attempts managed service restart, falls back to doctor --repair --non-interactive, and reports success/error summaries correctly.
  • Edge cases checked:
    • os.userInfo() failure falls back to os.homedir(), and unresolved trusted home fails closed.
    • monitored rescue-profile names are rejected.
    • restart/repair flows respect cron timeout budget and abort behavior.
    • launchd temp plist files are cleaned up if rename fails.
  • What you did not verify:
    • manual end-to-end watchdog repair on a real long-running host with an actually failing managed gateway.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this PR.
  • Files/config to restore: src/daemon/*, src/process/exec.ts, src/cron/*, src/rescue/watchdog-shared.ts, src/gateway/server-cron.ts
  • Known bad symptoms reviewers should watch for: launchd install unexpectedly failing on valid paths, or isolated cron jobs rejecting valid agentTurn jobs after the new payload kind was added.

Risks and Mitigations

  • Risk: LaunchAgent path hardening could reject previously tolerated but unsafe filesystem layouts.
    • Mitigation: the checks are limited to symlinked/non-directory targets in the LaunchAgents path and have focused test coverage.
  • Risk: adding a new cron payload kind could accidentally loosen validation or delivery behavior.
    • Mitigation: rescueWatchdog is explicitly restricted to isolated jobs, delivery is rejected, and schema/service/normalization paths have dedicated tests.
  • Risk: watchdog repair code could leave stuck restart attempts behind.
    • Mitigation: abort propagation is threaded through the restart backends and the watchdog enforces bounded restart/repair timeouts.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 14, 2026

Greptile Summary

This PR introduces the core service layer for a rescue watchdog feature: a new rescueWatchdog cron payload kind that runs as an isolated job to probe and repair an unhealthy managed gateway profile, along with security hardening of the LaunchAgent plist write path and AbortSignal support across all three managed-service restart backends (launchd, systemd, schtasks).

Key changes and findings:

  • src/cron/rescue-watchdog.ts: The watchdog execution loop is well-structured (probe → restart → wait → doctor fallback), and the timeout-budget accounting and abort handling are thorough in the service-restart step — but params.abortSignal is not forwarded to runCommandWithTimeout in the doctor repair path, which means an in-flight doctor subprocess will not be killed when the job is aborted and can run for up to DOCTOR_REPAIR_TIMEOUT_MS (60 s) uninterrupted.
  • src/cron/normalize.ts: The block that resolved sessionTarget: "current"session:<key> or "isolated" was silently removed. Without a schema-level rejection or store migration, callers still sending "current" will have that value stored verbatim, leading to silent misbehavior at execution time.
  • src/daemon/launchd.ts: Solid security hardening — trusted-home resolution via os.userInfo() (ignoring the HOME env var), symlink rejection for both the LaunchAgents directory and the plist target itself, atomic writes with O_NOFOLLOW, and tighter 0600 plist permissions are all correctly implemented and covered by tests.
  • src/process/exec.ts: Clean AbortSignal integration — the already-aborted case, listener registration, and cleanup on both error and close paths are all handled correctly.
  • src/cron/service/jobs.ts: rescueWatchdog is correctly threaded through all validation (delivery rejection, failure-destination rejection, assertSupportedJobSpec, merge, and build-from-patch paths), with test coverage for each.

Confidence Score: 3/5

  • Safe to merge after addressing the doctor-repair abort gap and verifying the intent behind the removal of sessionTarget "current" resolution.
  • The overall structure is sound and the security hardening in the launchd path is genuinely useful. Two concrete issues lower the score: (1) the AbortSignal is not forwarded to runCommandWithTimeout in the doctor repair branch, leaving a subprocess alive after cancellation; and (2) the silent removal of sessionTarget: "current" resolution has no schema guard or migration, risking silent mis-storage for any caller still using that value.
  • src/cron/rescue-watchdog.ts (abort signal gap in doctor repair) and src/cron/normalize.ts (removed "current" sessionTarget resolution).

Comments Outside Diff (1)

  1. src/cron/rescue-watchdog.ts, line 1037-1043 (link)

    AbortSignal not threaded into doctor repair subprocess

    The service restart step properly uses runBoundedStep which passes an AbortSignal into service.restart(...), so that step can be canceled cleanly. The doctor fallback, however, calls runCommandWithTimeout without forwarding params.abortSignal. As a result, if the watchdog job is aborted while the doctor subprocess is running, the subprocess will not be killed and the call will block for up to DOCTOR_REPAIR_TIMEOUT_MS (60 seconds) before the timeout fires.

    The existing signal support added to runCommandWithTimeout in this same PR (src/process/exec.ts) is exactly the right tool here, but it isn't wired in.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/cron/rescue-watchdog.ts
    Line: 1037-1043
    
    Comment:
    **AbortSignal not threaded into doctor repair subprocess**
    
    The service restart step properly uses `runBoundedStep` which passes an `AbortSignal` into `service.restart(...)`, so that step can be canceled cleanly. The doctor fallback, however, calls `runCommandWithTimeout` without forwarding `params.abortSignal`. As a result, if the watchdog job is aborted while the doctor subprocess is running, the subprocess will not be killed and the call will block for up to `DOCTOR_REPAIR_TIMEOUT_MS` (60 seconds) before the timeout fires.
    
    The existing `signal` support added to `runCommandWithTimeout` in this same PR (`src/process/exec.ts`) is exactly the right tool here, but it isn't wired in.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cron/rescue-watchdog.ts
Line: 1037-1043

Comment:
**AbortSignal not threaded into doctor repair subprocess**

The service restart step properly uses `runBoundedStep` which passes an `AbortSignal` into `service.restart(...)`, so that step can be canceled cleanly. The doctor fallback, however, calls `runCommandWithTimeout` without forwarding `params.abortSignal`. As a result, if the watchdog job is aborted while the doctor subprocess is running, the subprocess will not be killed and the call will block for up to `DOCTOR_REPAIR_TIMEOUT_MS` (60 seconds) before the timeout fires.

The existing `signal` support added to `runCommandWithTimeout` in this same PR (`src/process/exec.ts`) is exactly the right tool here, but it isn't wired in.

```suggestion
  const doctorResult = await runCommandWithTimeout(doctorArgv, {
    timeoutMs:
      typeof remainingJobBudgetMs === "number"
        ? Math.min(DOCTOR_REPAIR_TIMEOUT_MS, remainingJobBudgetMs)
        : DOCTOR_REPAIR_TIMEOUT_MS,
    env,
    signal: params.abortSignal,
  });
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/normalize.ts
Line: 481

Comment:
**Removal of `sessionTarget: "current"` resolution may silently break existing callers**

The block that resolved `sessionTarget === "current"``session:${sessionKey}` (or `"isolated"` when no key was available) has been removed without a corresponding schema-level rejection. Any caller that currently passes `sessionTarget: "current"` to `normalizeCronJobInput` will now have `"current"` stored verbatim as the job's `sessionTarget`.

Because `"current"` is not a recognized execution target in `executeJobCore` or the gateway cron service, those jobs would either silently misbehave or be routed incorrectly at run time. The store-migration path also has no handling for this value.

If the intent is to stop supporting `"current"` as an input, the fix is to explicitly reject it at the schema/validation layer (or document the change) so callers get a clear error rather than silent mis-storage. If the translation was only ever applied to internal paths, a comment explaining the invariant and a test for the `"current"` input would help reviewers verify nothing broke.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: a518ef8

Comment thread src/cron/normalize.ts
@shichangs
Copy link
Copy Markdown
Contributor Author

@codex review

@shichangs
Copy link
Copy Markdown
Contributor Author

Addressed the remaining actionable Greptile point in shichangs@f3c79d7.

Changes in this push:

  • thread abortSignal into the doctor --repair --non-interactive fallback in src/cron/rescue-watchdog.ts
  • add a regression test that asserts the doctor subprocess receives the job abort signal
  • fix the repo-formatting drift in src/gateway/server/ws-connection/message-handler.ts that was tripping pnpm format:check

Local verification after the change:

  • pnpm exec vitest run src/cron/rescue-watchdog.test.ts src/cron/normalize.test.ts
  • pnpm check
  • pnpm build
  • pnpm openclaw config validate --json

@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label Mar 14, 2026
@shichangs shichangs requested a review from a team as a code owner March 14, 2026 18:50
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation scripts Repository scripts labels Mar 14, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3c47fc9af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f1a209602

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 668ea0c796

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts Outdated
@shichangs
Copy link
Copy Markdown
Contributor Author

Revalidated the current PR head 8fde0cb0300210a48a9d7418a586440cbdf680a0 today.

Current status from this pass:

  • unresolved review thread count on Rescue: add watchdog core service and cron engine #46502 is 0
  • the GitHub API currently reports mergeable=CONFLICTING / mergeStateStatus=DIRTY, but the reported base SHA 8db6fcca777ac751597b1290a201d0df6161f9f2 is an ancestor of the current PR head in a fresh local clone, so I could not reproduce an actual content conflict from the current refs
  • targeted verification on the current PR head passed:
    • pnpm test -- src/cron/rescue-watchdog.test.ts src/rescue/watchdog-shared.test.ts src/daemon/launchd.test.ts src/daemon/program-args.test.ts src/process/exec.test.ts src/process/exec.no-output-timer.test.ts src/daemon/schtasks.stop.test.ts src/daemon/schtasks.test.ts src/cron/store-migration.test.ts src/cron/normalize.test.ts src/gateway/probe.test.ts src/gateway/client.test.ts src/daemon/constants.test.ts src/cli/update-cli/restart-helper.test.ts
    • pnpm exec oxlint --type-aware on the touched watchdog / daemon / gateway files
    • pnpm exec oxfmt --check on the touched watchdog / daemon / gateway files

Environment-only blockers I re-confirmed in that local clone:

  • pnpm openclaw --help and pnpm openclaw config validate --json still fail because the local dependency tree is missing @modelcontextprotocol/sdk
  • pnpm build still stops in pnpm canvas:a2ui:bundle before the rest of the build pipeline

I did not find any new actionable source issues on top of the current PR head.

@joshavant @tyler6204 could you take another look when you have a chance, especially at the stale/conflicting mergeability signal?

@shichangs
Copy link
Copy Markdown
Contributor Author

All current P1/P2 review threads on #46502 are resolved on the latest PR head 8fde0cb030.

I re-checked the final head locally against the current PR diff:

  • gh confirms the remote PR head matches local 8fde0cb0300210a48a9d7418a586440cbdf680a0
  • unresolved review thread count is 0
  • the last follow-up commit only trims unrelated model-compat drift from this branch, and I did not find a remaining actionable issue in that delta during self-review

Local validation in this worktree is still partially blocked by environment limits:

  • git fetch cannot reach github.com here because DNS resolution is failing
  • this worktree had no node_modules
  • pnpm install --offline --no-frozen-lockfile still fails because the local pnpm store is missing @opentelemetry/exporter-logs-otlp-proto@0.213.0

So I could not rerun pnpm build, pnpm check, pnpm test, or a local OpenClaw CLI smoke from this exact worktree after the final head move.

@joshavant @tyler6204 could you take another look when you have a chance?

@shichangs
Copy link
Copy Markdown
Contributor Author

Re-checked #46502 on the current PR head 8fde0cb0300210a48a9d7418a586440cbdf680a0.

Current status from this pass:

  • unresolved review thread count is 0
  • all current P1/P2 review items on the PR are already addressed or resolved
  • GitHub still reports mergeable=CONFLICTING / mergeStateStatus=DIRTY, but the reported base SHA 8db6fcca777ac751597b1290a201d0df6161f9f2 is an ancestor of the current head, so I could not find an actual content conflict from the current refs

No source change was needed in this pass.

@joshavant @tyler6204 could one of you take another look when convenient, especially at the stale mergeability signal?

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d9bc46033

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/rescue-watchdog.ts Outdated
…-core

# Conflicts:
#	src/agents/model-compat.test.ts
#	src/daemon/program-args.test.ts
#	src/gateway/probe.test.ts
#	src/gateway/probe.ts
#	src/gateway/server/ws-connection/message-handler.ts
@openclaw-barnacle openclaw-barnacle Bot removed docs Improvements or additions to documentation scripts Repository scripts labels Mar 19, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0d8679e57

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/gateway/client.ts
@shichangs
Copy link
Copy Markdown
Contributor Author

All current P1/P2 review threads on #46502 are resolved again.

Latest follow-up commit:

What changed in this pass:

  • stop the rescue watchdog immediately when GatewayService.restart() only schedules an in-process launchd handoff
  • avoid probing the still-running gateway or falling through to doctor --repair before the queued restart can happen
  • add regression coverage for the scheduled-restart path

Local verification on this push:

  • pnpm test -- src/cron/rescue-watchdog.test.ts
  • pnpm exec oxfmt --check src/cron/rescue-watchdog.ts src/cron/rescue-watchdog.test.ts
  • pnpm exec oxlint --type-aware src/cron/rescue-watchdog.ts src/cron/rescue-watchdog.test.ts

Broader validation in this environment is still partially blocked by missing optional/shared dependencies:

  • pnpm check fails in the shared checkout because multiple optional extension packages are unavailable there
  • pnpm openclaw config validate --json rebuilt dist/* but the CLI still fails to start here because @modelcontextprotocol/sdk is missing from the shared node_modules

@joshavant @tyler6204 could you take another look when convenient?

@shichangs
Copy link
Copy Markdown
Contributor Author

@steipete all review threads are resolved now, including the last local probe follow-up in 66d9836. Ready for another pass when you have a minute.

@shichangs
Copy link
Copy Markdown
Contributor Author

Re-checked the current PR head 66d9836baf956f07e9ef39c535a1b3f3edf1a0e3.

Current status from this pass:

Targeted local verification on the same commit in the existing branch worktree:

  • pnpm exec vitest run --configLoader runner --config vitest.gateway.config.ts src/gateway/probe.test.ts
  • pnpm exec vitest run --configLoader runner --config vitest.unit.config.ts src/cron/rescue-watchdog.test.ts
  • pnpm exec oxlint --type-aware src/gateway/probe.ts src/gateway/probe.test.ts src/cron/rescue-watchdog.ts src/cron/rescue-watchdog.test.ts
  • pnpm exec oxfmt --check src/gateway/probe.ts src/gateway/probe.test.ts src/cron/rescue-watchdog.ts src/cron/rescue-watchdog.test.ts

The only remaining local blocker I hit here was CLI smoke from pnpm openclaw --help / pnpm openclaw config validate --json: this machine's source-run rebuild currently fails to import tsdown from the local dependency tree before the CLI starts.

@joshavant @tyler6204 could one of you take another pass when convenient?

@shichangs
Copy link
Copy Markdown
Contributor Author

Re-checked the current PR head 66d9836baf956f07e9ef39c535a1b3f3edf1a0e3.

Current status from this pass:

  • unresolved review thread count is 0
  • mergeable=MERGEABLE
  • a fresh self-review of the latest probeGateway() default-device-identity follow-up did not turn up a new source issue; the change only restores the default local probe path to device-bound while rescue-watchdog still opts in explicitly with disableDeviceIdentity: true

Local verification from the current branch worktree:

  • pnpm exec oxlint --type-aware src/gateway/probe.ts src/gateway/probe.test.ts src/cli/daemon-cli/restart-health.ts src/cli/daemon-cli/restart-health.test.ts passed
  • pnpm exec oxfmt --check src/gateway/probe.ts src/gateway/probe.test.ts src/cli/daemon-cli/restart-health.ts src/cli/daemon-cli/restart-health.test.ts passed
  • targeted pnpm test -- src/gateway/probe.test.ts src/cli/daemon-cli/restart-health.test.ts was blocked here by a sandbox write failure in Vitest's generated .vite-temp file path, not by a test assertion failure
  • pnpm openclaw --help is still blocked in this environment by the existing source-run rebuild issue importing tsdown

The current red CI also looks unrelated to this PR diff. The failing jobs are dying during pnpm install before tests/build start, with git SSH fetch errors against unchanged dependency sources:

  • git@github.com:tloncorp/api-beta.git: Permission denied (publickey)
  • git@github.com:whiskeysockets/libsignal-node.git: Permission denied (publickey)

Those repos are not introduced by this PR diff, and the latest head only touches watchdog / daemon / gateway paths.

@joshavant @tyler6204 could one of you take another pass when convenient, especially with the current CI install/auth failure in mind?

@shichangs
Copy link
Copy Markdown
Contributor Author

Re-checked #46502 on the current head 66d9836baf.

Current state:

  • all review threads are resolved, including the P1/P2 items raised so far
  • no additional code changes were needed in this pass after self-reviewing the latest probe/restart follow-ups

Local verification on this head:

  • pnpm test -- src/gateway/probe.test.ts src/cli/daemon-cli/restart-health.test.ts src/cron/rescue-watchdog.test.ts
  • pnpm exec oxlint --type-aware src/gateway/probe.ts src/gateway/probe.test.ts src/cli/daemon-cli/restart-health.ts src/cron/rescue-watchdog.ts src/cron/rescue-watchdog.test.ts
  • pnpm exec oxfmt --check src/gateway/probe.ts src/gateway/probe.test.ts src/cli/daemon-cli/restart-health.ts src/cron/rescue-watchdog.ts src/cron/rescue-watchdog.test.ts

Notes:

  • a fresh pnpm install in this worktree is still blocked by DNS resolution failures while fetching the git dependency for extensions/tlon
  • pnpm openclaw config validate --json entered the local build path but did not produce a final result in this environment, so I am not claiming a full CLI smoke from this run

@joshavant @tyler6204 could you take another pass when convenient?

@shichangs
Copy link
Copy Markdown
Contributor Author

Follow-up from local self-review: I pushed shichangs@1ba5b62 to fix a test typing mismatch on the current PR head.

What changed:

  • align the restartService mock in src/cron/rescue-watchdog.test.ts with GatewayServiceRestartResult
  • update the hanging/abort restart test cases so they keep the new restart result type instead of returning bare void

Local verification on this head:

  • pnpm test -- src/cron/rescue-watchdog.test.ts
  • pnpm exec oxfmt --check src/cron/rescue-watchdog.test.ts
  • pnpm exec oxlint --type-aware src/cron/rescue-watchdog.test.ts src/cron/rescue-watchdog.ts
  • pnpm exec tsc -p tsconfig.json --noEmit --pretty false 2>&1 | rg 'src/cron/rescue-watchdog.test.ts' (no matches)

Notes:

  • all current review threads remain resolved (0 unresolved)
  • pnpm check in the borrowed dependency environment still reports unrelated missing optional extension deps, so I did not treat that as a PR-local regression

@joshavant @tyler6204 could you take another pass when convenient?

@shichangs
Copy link
Copy Markdown
Contributor Author

Re-checked the current PR head 1ba5b624332dec6ed1a4f24d01aebfd369403393.

Current state:

  • all current P1/P2 review threads on Rescue: add watchdog core service and cron engine #46502 are resolved
  • the latest follow-up commit only tightens src/cron/rescue-watchdog.test.ts typing around GatewayServiceRestartResult
  • I did a fresh self-review on that delta and do not see a remaining correctness issue in the current branch

Local verification in this workspace is still blocked because node_modules is missing here, and pnpm install fails while fetching the git dependency for extensions/tlon due Could not resolve hostname github.com.

@joshavant @tyler6204 could you take another pass when convenient?

@shichangs
Copy link
Copy Markdown
Contributor Author

Revalidated the current PR head 1ba5b62433 locally. All current P1/P2 review threads on #46502 remain resolved.

Current validation on this head:

  • pnpm test -- src/cron/rescue-watchdog.test.ts src/gateway/probe.test.ts src/cli/daemon-cli/restart-health.test.ts
  • pnpm exec oxlint --type-aware src/cron/rescue-watchdog.ts src/cron/rescue-watchdog.test.ts src/gateway/probe.ts src/gateway/probe.test.ts src/gateway/client.ts src/gateway/client.test.ts src/cli/daemon-cli/restart-health.ts src/cli/daemon-cli/restart-health.test.ts

Environment notes from this run:

  • pnpm install --offline is still blocked here because extensions/tlon needs a git fetch from github.com, which currently fails on DNS resolution in this workspace.
  • pnpm check still fails in this borrowed-dependency setup on missing optional extension deps outside this PR's touch surface.
  • Running pnpm build and pnpm openclaw config validate --json concurrently against the borrowed node_modules did not produce a clean completion signal, so I am not treating those two as validated in this environment.

@joshavant @tyler6204 could you take a look when you have a chance?

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 28, 2026

Codex review: needs changes before merge.

Summary
The PR adds LaunchAgent and managed-service hardening, abortable restart paths, a new isolated rescueWatchdog cron payload and runner, gateway cron wiring, gateway probe/client adjustments, and focused tests.

Reproducibility: yes. for the concrete review findings: source comparison shows the PR head omits current main launchd enable/recovery behavior, and a PR-head changelog search shows no rescueWatchdog entry. The broader feature itself has unit-level evidence in the PR discussion but no live end-to-end repair reproduction.

Next step before merge
A narrow automated repair can restore current launchd restart semantics and add the missing changelog entry before normal maintainer/security review resumes.

Security
Cleared: No concrete supply-chain, dependency-source, workflow-permission, or secret-handling issue was found, though the diff remains security-sensitive because it changes local service control and repair command execution.

Review findings

  • [P2] Re-enable LaunchAgent before direct kickstart — src/daemon/launchd.ts:647
  • [P2] Preserve kickstart-failure recovery — src/daemon/launchd.ts:653-654
  • [P3] Add the required changelog entry — src/cron/types.ts:81-84
Review details

Best possible solution:

Rebase the branch onto current main, preserve launchd restart semantics including enable-before-kickstart and kickstart-failure recovery, add the active-release changelog entry, then continue maintainer/security review of the rescueWatchdog contract.

Do we have a high-confidence way to reproduce the issue?

Yes for the concrete review findings: source comparison shows the PR head omits current main launchd enable/recovery behavior, and a PR-head changelog search shows no rescueWatchdog entry. The broader feature itself has unit-level evidence in the PR discussion but no live end-to-end repair reproduction.

Is this the best way to solve the issue?

No, not as-is. The isolated cron/service approach is plausible, but the branch must first preserve current launchd restart behavior and satisfy changelog policy before it is a maintainable merge candidate.

Full review comments:

  • [P2] Re-enable LaunchAgent before direct kickstart — src/daemon/launchd.ts:647
    Current main clears persisted launchctl disabled state before kickstart -k, so an explicit gateway restart can recover a LaunchAgent that was previously disabled. This branch goes straight to kickstart on the direct path, which can leave the service disabled and unmanaged; call launchctl enable before kickstart as main does.
    Confidence: 0.9
  • [P2] Preserve kickstart-failure recovery — src/daemon/launchd.ts:653-654
    Current main checks whether a failed kickstart -k left the LaunchAgent unloaded and attempts to bootstrap it before surfacing the original failure. This branch throws immediately on non-not-loaded kickstart failures, so the restart path can leave the service unloaded after the failure that current main repairs.
    Confidence: 0.84
  • [P3] Add the required changelog entry — src/cron/types.ts:81-84
    This adds a user-facing rescueWatchdog cron payload and service-repair capability, and the PR is marked as a feature/security-hardening change, but the branch has no matching CHANGELOG.md entry. Add a single-line entry under the active release section.
    Confidence: 0.83

Overall correctness: patch is incorrect
Overall confidence: 0.86

Acceptance criteria:

  • pnpm test -- src/daemon/launchd.test.ts src/daemon/launchd-restart-handoff.test.ts src/cron/rescue-watchdog.test.ts src/rescue/watchdog-shared.test.ts src/cron/normalize.test.ts src/cron/service.jobs.test.ts src/cron/store-migration.test.ts src/gateway/probe.test.ts src/process/exec.test.ts src/daemon/schtasks.test.ts src/daemon/schtasks.stop.test.ts
  • pnpm exec oxfmt --check --threads=1 src/daemon/launchd.ts src/daemon/launchd.test.ts src/daemon/launchd-restart-handoff.ts src/cron/rescue-watchdog.ts src/rescue/watchdog-shared.ts src/cron/types.ts src/cron/service/timer.ts src/cron/service/jobs.ts src/gateway/server-cron.ts src/gateway/probe.ts src/process/exec.ts CHANGELOG.md
  • pnpm exec oxlint --type-aware src/daemon/launchd.ts src/daemon/launchd.test.ts src/cron/rescue-watchdog.ts src/cron/rescue-watchdog.test.ts src/gateway/probe.ts src/gateway/probe.test.ts
  • pnpm check:changed in Testbox after rebase

What I checked:

Likely related people:

  • steipete: Recent history shows Peter Steinberger carrying launchd recovery/refactor work and cron contract/runtime maintenance near the PR surface. (role: recent maintainer; confidence: high; commits: 48653c203155, aa497e9c52d7, fd968bfb2d16; files: src/daemon/launchd.ts, src/cron/types.ts, src/cron/service/timer.ts)
  • Nimrod Gutman: Recent launchd lifecycle commits scoped enable/stop behavior in the same service-control area that the PR changes. (role: adjacent launchd owner; confidence: medium; commits: affffddf045d, c0ddcf663016, 23d9a100c4aa; files: src/daemon/launchd.ts)
  • Sally O'Malley: The launchd hardening commit immediately before this PR's base touched local backend pairing and launchd restart behavior relevant to the current blocker. (role: adjacent launchd maintainer; confidence: medium; commits: 8db6fcca777a; files: src/daemon/launchd.ts, src/daemon/launchd.test.ts, src/gateway/probe.ts)
  • Tyler Yust: Earlier merged cron work introduced and hardened isolated job delivery/service behavior adjacent to the new rescueWatchdog payload path. (role: cron contract contributor; confidence: medium; commits: 511c656cbcb6, 3f82daefd801, 8fae55e8e057; files: src/cron/types.ts, src/cron/service/timer.ts, src/gateway/protocol/schema/cron.ts)

Remaining risk / open question:

  • The branch is large and security-sensitive because it changes local service control, plist writes, child-process cancellation, gateway probing, and doctor repair execution.
  • No live end-to-end watchdog repair on a real long-running unhealthy host is documented; the PR body explicitly notes this gap.
  • The branch is stale relative to current main launchd recovery behavior and needs a careful rebase to avoid regressing recently shipped restart fixes.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 443f7035a2e5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui cli CLI command changes gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant