feat: add external rescue watchdog#40791
Conversation
Greptile SummaryThis PR adds an optional external rescue watchdog ( Key findings:
Confidence Score: 3/5
Last reviewed commit: 817a95c |
| if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then | ||
| runner_status="ok" | ||
| else | ||
| runner_status="failed:$?" | ||
| fi |
There was a problem hiding this comment.
Login shell may reset environment for runner
bash -lc "$RUNNER_CMD" spawns a login shell, which sources /etc/profile, ~/.bash_profile, and similar profile scripts. In some environments these profile scripts reset PATH, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. OPENCLAW_RESCUE_INCIDENT_DIR, OPENCLAW_RESCUE_WORKSPACE_DIR). If those variables disappear, the runner will fail with OPENCLAW_RESCUE_INCIDENT_DIR is required.
For the typical operator use-case where RUNNER_CMD is simply a script path, using bash -c (non-login) or direct execution would be more predictable:
| if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then | |
| runner_status="ok" | |
| else | |
| runner_status="failed:$?" | |
| fi | |
| if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then | |
| runner_status="ok" | |
| else | |
| runner_status="failed:$?" | |
| fi |
If login-shell PATH resolution is intentionally needed (e.g. so the runner can find claude installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.
Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/rescue-watchdog.sh
Line: 240-244
Comment:
**Login shell may reset environment for runner**
`bash -lc "$RUNNER_CMD"` spawns a login shell, which sources `/etc/profile`, `~/.bash_profile`, and similar profile scripts. In some environments these profile scripts reset `PATH`, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. `OPENCLAW_RESCUE_INCIDENT_DIR`, `OPENCLAW_RESCUE_WORKSPACE_DIR`). If those variables disappear, the runner will fail with `OPENCLAW_RESCUE_INCIDENT_DIR is required`.
For the typical operator use-case where `RUNNER_CMD` is simply a script path, using `bash -c` (non-login) or direct execution would be more predictable:
```suggestion
if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
runner_status="ok"
else
runner_status="failed:$?"
fi
```
If login-shell PATH resolution is intentionally needed (e.g. so the runner can find `claude` installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.
How can I resolve this? If you propose a fix, please make it concise.| cat >"$PROMPT_FILE" <<EOF | ||
| An OpenClaw rescue watchdog detected a runtime failure. | ||
|
|
||
| Goals: | ||
| - Restore reply capability with the smallest safe fix. | ||
| - Prefer narrow changes over refactors. | ||
| - Do not commit, push, or modify files outside this workspace. | ||
| - Use the captured incident artifacts before exploring broadly. | ||
|
|
||
| Workspace: | ||
| - Repo: $WORKSPACE_DIR | ||
| - Incident dir: $INCIDENT_DIR | ||
|
|
||
| Artifacts: | ||
| - Summary: $SUMMARY_FILE | ||
| - Health snapshot: $HEALTH_FILE | ||
| - CLI status: $STATUS_FILE | ||
| - Gateway log tail: $LOG_FILE | ||
|
|
||
| Required steps: | ||
| 1. Read the incident artifacts first. | ||
| 2. Inspect only the code paths suggested by the incident. | ||
| 3. If you find a safe fix, apply it in this workspace. | ||
| 4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check. | ||
| 5. If possible, run openclaw health --json --timeout 10000 after the fix. | ||
| 6. Print a concise operator summary with: | ||
| - root cause | ||
| - files changed | ||
| - verification run | ||
| - remaining risk | ||
|
|
||
| Incident summary: | ||
| $summary_text | ||
| EOF |
There was a problem hiding this comment.
Heredoc delimiter could be prematurely closed
The heredoc uses EOF as its delimiter, and $summary_text is expanded inside it. If an openclaw error message ever produces a line that is exactly the string EOF (e.g. a log line like ... error at EOF), bash will close the heredoc at that line, silently truncating the prompt written to $PROMPT_FILE.
Using a less likely delimiter prevents this:
| cat >"$PROMPT_FILE" <<EOF | |
| An OpenClaw rescue watchdog detected a runtime failure. | |
| Goals: | |
| - Restore reply capability with the smallest safe fix. | |
| - Prefer narrow changes over refactors. | |
| - Do not commit, push, or modify files outside this workspace. | |
| - Use the captured incident artifacts before exploring broadly. | |
| Workspace: | |
| - Repo: $WORKSPACE_DIR | |
| - Incident dir: $INCIDENT_DIR | |
| Artifacts: | |
| - Summary: $SUMMARY_FILE | |
| - Health snapshot: $HEALTH_FILE | |
| - CLI status: $STATUS_FILE | |
| - Gateway log tail: $LOG_FILE | |
| Required steps: | |
| 1. Read the incident artifacts first. | |
| 2. Inspect only the code paths suggested by the incident. | |
| 3. If you find a safe fix, apply it in this workspace. | |
| 4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check. | |
| 5. If possible, run openclaw health --json --timeout 10000 after the fix. | |
| 6. Print a concise operator summary with: | |
| - root cause | |
| - files changed | |
| - verification run | |
| - remaining risk | |
| Incident summary: | |
| $summary_text | |
| EOF | |
| cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF' | |
| An OpenClaw rescue watchdog detected a runtime failure. | |
| Goals: | |
| - Restore reply capability with the smallest safe fix. | |
| - Prefer narrow changes over refactors. | |
| - Do not commit, push, or modify files outside this workspace. | |
| - Use the captured incident artifacts before exploring broadly. | |
| Workspace: | |
| - Repo: WORKSPACE_DIR_PLACEHOLDER | |
| - Incident dir: INCIDENT_DIR_PLACEHOLDER |
Note: switching to a quoted (<<'...') heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:
cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END
...
$summary_text
RESCUE_WATCHDOG_PROMPT_ENDThis makes accidental closure essentially impossible while keeping all $variable substitutions working.
Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/claude-rescue-runner.sh
Line: 34-67
Comment:
**Heredoc delimiter could be prematurely closed**
The heredoc uses `EOF` as its delimiter, and `$summary_text` is expanded inside it. If an `openclaw` error message ever produces a line that is exactly the string `EOF` (e.g. a log line like `... error at EOF`), bash will close the heredoc at that line, silently truncating the prompt written to `$PROMPT_FILE`.
Using a less likely delimiter prevents this:
```suggestion
cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF'
An OpenClaw rescue watchdog detected a runtime failure.
Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.
Workspace:
- Repo: WORKSPACE_DIR_PLACEHOLDER
- Incident dir: INCIDENT_DIR_PLACEHOLDER
```
Note: switching to a quoted (`<<'...'`) heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:
```bash
cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END
...
$summary_text
RESCUE_WATCHDOG_PROMPT_END
```
This makes accidental closure essentially impossible while keeping all `$variable` substitutions working.
How can I resolve this? If you propose a fix, please make it concise.|
Addressed the Greptile feedback in d8b47c5.
Re-checked both scripts with |
|
Added a follow-up fix for the failing merge check in This adds explicit exported helper types so the generated declaration surface no longer depends on inferred Local verification on this branch:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e1544c663
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [ -z "$incident_reason" ]; then | ||
| echo "openclaw rescue watchdog: healthy" | ||
| exit 0 |
There was a problem hiding this comment.
Clear suppression state after a healthy cycle
When incident_reason is empty, this branch exits immediately without updating state.env, so LAST_FINGERPRINT and LAST_TRIGGERED_AT from a prior outage remain active. In a fail → recover → fail sequence within OPENCLAW_RESCUE_COOLDOWN_SEC, the second outage is incorrectly treated as a duplicate and skips incident capture/runner execution/notification even though it is a new event after recovery.
Useful? React with 👍 / 👎.
|
Updated the branch to current Local verification on this branch:
The watchdog changes are unchanged; this follow-up only touches test typing/inference on top of the current base. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22b231e30c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| incident_id="$(date -u +"%Y%m%dT%H%M%SZ")" | ||
| incident_dir="$INCIDENT_ROOT/$incident_id" | ||
| mkdir -p "$incident_dir" |
There was a problem hiding this comment.
Generate unique incident IDs per run
When two watchdog invocations detect an incident within the same second (for example, an overlapping cron run plus a manual run), both derive the same incident_id and write into the same directory because the ID is second-granularity and mkdir -p reuses existing paths. Under set -e, this can cause one run to fail during file writes/copies and leave a mixed or partial incident bundle instead of two independent captures.
Useful? React with 👍 / 👎.
|
Synced this branch with the latest upstream main in commit fc6cfc7 and resolved the merge conflicts in the previously added test-only type fixes.\n\nLocal verification on 2026-03-14:\n- pnpm install --frozen-lockfile\n- pnpm tsgo\n- pnpm exec vitest run --config vitest.unit.config.ts src/plugins/install.test.ts src/hooks/install.test.ts src/gateway/server-runtime-config.test.ts src/gateway/server.talk-config.test.ts src/telegram/network-config.test.ts src/infra/outbound/targets.test.ts src/infra/provider-usage.auth.normalizes-keys.test.ts\n- pnpm check\n\nThis also aligns the npm install assertion helper with the current install command shape after the recent upstream changes. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc6cfc7c24
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fi | ||
| printf '%s\n' "$runner_status" >"$incident_dir/runner.status" | ||
|
|
||
| write_state "$fingerprint" "$now_epoch" "$incident_id" |
There was a problem hiding this comment.
Persist suppression state before invoking runner
The watchdog writes state.env only after RUNNER_CMD returns, which leaves the latest fingerprint/cooldown invisible during long rescue runs. If another invocation starts in that window (for example, cron overlap or a manual rerun), it will treat the same outage as new and run a second incident capture/runner cycle, producing duplicate remediation attempts and notifications against the same failure. Save suppression state before launching the runner (or guard with a lock) so overlapping executions are deduplicated immediately.
Useful? React with 👍 / 👎.
|
Split the inherited CI fixes out into separate PRs so this feature PR can stay scoped:
Those failures were inherited from |
|
This pull request has been automatically marked as stale due to inactivity. |
|
Codex review: found issues before merge. Summary Reproducibility: yes. for the patch defects by source inspection. The fail-recover-fail cooldown path, same-second incident directory collision, and long-runner overlap follow directly from the proposed shell control flow; I did not execute the scripts because this review was read-only. Next step before merge Security Review findings
Review detailsBest possible solution: Pick one canonical rescue model first; if this external watchdog remains desired, harden state, concurrency, and sensitive-bundle handling, add focused scripted coverage, and split unrelated test typing changes. Do we have a high-confidence way to reproduce the issue? Yes for the patch defects by source inspection. The fail-recover-fail cooldown path, same-second incident directory collision, and long-runner overlap follow directly from the proposed shell control flow; I did not execute the scripts because this review was read-only. Is this the best way to solve the issue? No. The external watchdog may be useful, but this PR is not the best mergeable form until maintainers choose the rescue architecture and security posture across the overlapping rescue PRs and the concrete script defects are fixed. Full review comments:
Overall correctness: patch is incorrect Security concerns:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 88b21427f8ad. |
Summary
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
scripts/rescue-watchdog.shto detectopenclaw health --jsonfailures and channellastErrorsnapshots, capture incident bundles, de-duplicate repeats, optionally notify, and optionally invoke a rescue runner.scripts/claude-rescue-runner.shas a minimal Claude Code handoff runner for captured incidents.scripts/systemd/openclaw-rescue-watchdog.{service,timer}templates.docs/automation/rescue-watchdog.mdand linked it from auth monitoring docs.Security Impact (required)
Yes)No)Yes)Yes)Yes)Yes, explain risk + mitigation:The watchdog and runner are optional, external scripts. They execute only when explicitly installed/configured, default to no notification target and no runner, and keep the repair path outside the gateway process. The Claude runner is conservative by default (
acceptEdits) and does not commit or push changes.Repro + Verification
Environment
CLAUDE_BIN=echoOPENCLAW_BIN=falseSteps
bash -n scripts/rescue-watchdog.shandbash -n scripts/claude-rescue-runner.sh.OPENCLAW_BIN=false OPENCLAW_RESCUE_STATE_DIR="$PWD/.artifacts/rescue-watchdog-test" ./scripts/rescue-watchdog.sh.OPENCLAW_RESCUE_INCIDENT_DIR=... OPENCLAW_RESCUE_WORKSPACE_DIR="$PWD" CLAUDE_BIN=echo ./scripts/claude-rescue-runner.shto validate prompt/output generation.Expected
claude-prompt.txtandclaude-output.txt.Actual
Evidence
Human Verification (required)
What you personally verified (not just CI), and how:
healthcommand failure path, repeated identical incident suppression, runner not configured path.Review Conversations
Compatibility / Migration
Yes)Yes)No)Set the optional watchdog/runner environment variables and enable the provided systemd timer if you want the rescue flow.
Failure Recovery (if this breaks)
scripts/systemd/openclaw-rescue-watchdog.{service,timer}from the user systemd dir and unsetOPENCLAW_RESCUE_*/CLAUDE_RESCUE_*env vars.claudeis unavailable on PATH.Risks and Mitigations
Mitigation: the docs explicitly describe the flow as optional and external.
Mitigation: the sample runner defaults to
acceptEdits, does not commit/push, and is fully configurable via env vars.