Skip to content

feat: add external rescue watchdog#40791

Open
NickHood1984 wants to merge 6 commits intoopenclaw:mainfrom
NickHood1984:codex/rescue-watchdog
Open

feat: add external rescue watchdog#40791
NickHood1984 wants to merge 6 commits intoopenclaw:mainfrom
NickHood1984:codex/rescue-watchdog

Conversation

@NickHood1984
Copy link
Copy Markdown
Contributor

Summary

  • Problem: OpenClaw can report health and channel errors, but there was no external rescue loop that stays alive when the gateway itself is unhealthy.
  • Why it matters: in-process recovery dies with the same process, so failures that stop replies had to be diagnosed and handed off manually.
  • What changed: added an external rescue watchdog script, a Claude Code runner, systemd user service/timer templates, and setup docs.
  • What did NOT change (scope boundary): no core gateway runtime behavior, no automatic commit/push flow, and no bundled launchd plist.

Change Type (select all)

  • Feature
  • Docs
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Related #

User-visible / Behavior Changes

  • Added scripts/rescue-watchdog.sh to detect openclaw health --json failures and channel lastError snapshots, capture incident bundles, de-duplicate repeats, optionally notify, and optionally invoke a rescue runner.
  • Added scripts/claude-rescue-runner.sh as a minimal Claude Code handoff runner for captured incidents.
  • Added scripts/systemd/openclaw-rescue-watchdog.{service,timer} templates.
  • Added docs/automation/rescue-watchdog.md and linked it from auth monitoring docs.

Security Impact (required)

  • New permissions/capabilities? (Yes)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (Yes)
  • Command/tool execution surface changed? (Yes)
  • Data access scope changed? (Yes)
  • If any Yes, explain risk + mitigation:
    The watchdog and runner are optional, external scripts. They execute only when explicitly installed/configured, default to no notification target and no runner, and keep the repair path outside the gateway process. The Claude runner is conservative by default (acceptEdits) and does not commit or push changes.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local source checkout
  • Model/provider: Claude Code CLI runner path validated with a stubbed CLAUDE_BIN=echo
  • Integration/channel (if any): none configured for notification during verification
  • Relevant config (redacted): synthetic incident via OPENCLAW_BIN=false

Steps

  1. Run bash -n scripts/rescue-watchdog.sh and bash -n scripts/claude-rescue-runner.sh.
  2. Trigger a synthetic incident with OPENCLAW_BIN=false OPENCLAW_RESCUE_STATE_DIR="$PWD/.artifacts/rescue-watchdog-test" ./scripts/rescue-watchdog.sh.
  3. Re-run the same command to verify duplicate suppression.
  4. Run OPENCLAW_RESCUE_INCIDENT_DIR=... OPENCLAW_RESCUE_WORKSPACE_DIR="$PWD" CLAUDE_BIN=echo ./scripts/claude-rescue-runner.sh to validate prompt/output generation.

Expected

  • The watchdog writes an incident bundle on first failure.
  • Repeated identical incidents within cooldown are skipped.
  • The Claude runner writes claude-prompt.txt and claude-output.txt.

Actual

  • Matched expected behavior.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: shell syntax checks, synthetic incident capture, duplicate suppression, runner prompt/output generation.
  • Edge cases checked: health command failure path, repeated identical incident suppression, runner not configured path.
  • What you did not verify: a live end-to-end repair against a real broken gateway or a real notification channel.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (Yes)
  • Migration needed? (No)
  • If yes, exact upgrade steps:
    Set the optional watchdog/runner environment variables and enable the provided systemd timer if you want the rescue flow.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: do not install/enable the timer, or remove the copied user systemd unit files.
  • Files/config to restore: remove scripts/systemd/openclaw-rescue-watchdog.{service,timer} from the user systemd dir and unset OPENCLAW_RESCUE_* / CLAUDE_RESCUE_* env vars.
  • Known bad symptoms reviewers should watch for: unexpected repeated incident notifications due to local environment differences, or runner failures because claude is unavailable on PATH.

Risks and Mitigations

  • Risk: operators may assume the rescue path is active by default.
    Mitigation: the docs explicitly describe the flow as optional and external.
  • Risk: the runner may be too permissive for some environments.
    Mitigation: the sample runner defaults to acceptEdits, does not commit/push, and is fully configurable via env vars.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation scripts Repository scripts size: M labels Mar 9, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 9, 2026

Greptile Summary

This PR adds an optional external rescue watchdog (scripts/rescue-watchdog.sh) that polls openclaw health, captures incident bundles on failure, deduplicates repeat events, optionally sends notifications, and can hand the incident off to a Claude Code runner (scripts/claude-rescue-runner.sh). Systemd user timer templates and setup docs are also included. The change is additive and backward compatible — nothing runs unless explicitly installed.

Key findings:

  • scripts/rescue-watchdog.sh line 240: the runner is launched via bash -lc "$RUNNER_CMD" (login shell). On many systems profile scripts (.bash_profile, etc.) reset PATH or unset environment variables, including the OPENCLAW_RESCUE_* variables the watchdog exports. If those variables are cleared, claude-rescue-runner.sh exits with OPENCLAW_RESCUE_INCIDENT_DIR is required. Using bash -c instead of bash -lc would be safer unless login-shell PATH resolution is explicitly needed.
  • scripts/claude-rescue-runner.sh lines 34–67: the heredoc uses EOF as its delimiter while expanding $summary_text inside it. Although summary_text is currently always single-line, any future openclaw error message that produces a bare EOF line would silently truncate the Claude prompt. Using a less common delimiter (e.g. RESCUE_WATCHDOG_PROMPT_END) eliminates this edge case at no cost.

Confidence Score: 3/5

  • Safe to merge with minor fixes; the login-shell runner invocation could silently break the rescue path in certain environments.
  • The change is fully opt-in and additive — no existing behavior is altered. The two flagged issues are low-to-moderate: the bash -lc runner invocation is a real risk in environments with profile-script side effects, and the heredoc delimiter is a defensive concern rather than a confirmed bug. Neither blocks the happy path described in the PR, but the bash -lc issue could cause the runner to fail silently in CI or containerized setups where profile scripts reset the environment.
  • Pay close attention to scripts/rescue-watchdog.sh (runner invocation at line 240) and scripts/claude-rescue-runner.sh (heredoc delimiter at lines 34–67).

Last reviewed commit: 817a95c

Comment thread scripts/rescue-watchdog.sh Outdated
Comment on lines +240 to +244
if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
runner_status="ok"
else
runner_status="failed:$?"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Login shell may reset environment for runner

bash -lc "$RUNNER_CMD" spawns a login shell, which sources /etc/profile, ~/.bash_profile, and similar profile scripts. In some environments these profile scripts reset PATH, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. OPENCLAW_RESCUE_INCIDENT_DIR, OPENCLAW_RESCUE_WORKSPACE_DIR). If those variables disappear, the runner will fail with OPENCLAW_RESCUE_INCIDENT_DIR is required.

For the typical operator use-case where RUNNER_CMD is simply a script path, using bash -c (non-login) or direct execution would be more predictable:

Suggested change
if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
runner_status="ok"
else
runner_status="failed:$?"
fi
if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
runner_status="ok"
else
runner_status="failed:$?"
fi

If login-shell PATH resolution is intentionally needed (e.g. so the runner can find claude installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.

Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/rescue-watchdog.sh
Line: 240-244

Comment:
**Login shell may reset environment for runner**

`bash -lc "$RUNNER_CMD"` spawns a login shell, which sources `/etc/profile`, `~/.bash_profile`, and similar profile scripts. In some environments these profile scripts reset `PATH`, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. `OPENCLAW_RESCUE_INCIDENT_DIR`, `OPENCLAW_RESCUE_WORKSPACE_DIR`). If those variables disappear, the runner will fail with `OPENCLAW_RESCUE_INCIDENT_DIR is required`.

For the typical operator use-case where `RUNNER_CMD` is simply a script path, using `bash -c` (non-login) or direct execution would be more predictable:

```suggestion
  if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
    runner_status="ok"
  else
    runner_status="failed:$?"
  fi
```

If login-shell PATH resolution is intentionally needed (e.g. so the runner can find `claude` installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread scripts/claude-rescue-runner.sh Outdated
Comment on lines +34 to +67
cat >"$PROMPT_FILE" <<EOF
An OpenClaw rescue watchdog detected a runtime failure.

Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.

Workspace:
- Repo: $WORKSPACE_DIR
- Incident dir: $INCIDENT_DIR

Artifacts:
- Summary: $SUMMARY_FILE
- Health snapshot: $HEALTH_FILE
- CLI status: $STATUS_FILE
- Gateway log tail: $LOG_FILE

Required steps:
1. Read the incident artifacts first.
2. Inspect only the code paths suggested by the incident.
3. If you find a safe fix, apply it in this workspace.
4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check.
5. If possible, run openclaw health --json --timeout 10000 after the fix.
6. Print a concise operator summary with:
- root cause
- files changed
- verification run
- remaining risk

Incident summary:
$summary_text
EOF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heredoc delimiter could be prematurely closed

The heredoc uses EOF as its delimiter, and $summary_text is expanded inside it. If an openclaw error message ever produces a line that is exactly the string EOF (e.g. a log line like ... error at EOF), bash will close the heredoc at that line, silently truncating the prompt written to $PROMPT_FILE.

Using a less likely delimiter prevents this:

Suggested change
cat >"$PROMPT_FILE" <<EOF
An OpenClaw rescue watchdog detected a runtime failure.
Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.
Workspace:
- Repo: $WORKSPACE_DIR
- Incident dir: $INCIDENT_DIR
Artifacts:
- Summary: $SUMMARY_FILE
- Health snapshot: $HEALTH_FILE
- CLI status: $STATUS_FILE
- Gateway log tail: $LOG_FILE
Required steps:
1. Read the incident artifacts first.
2. Inspect only the code paths suggested by the incident.
3. If you find a safe fix, apply it in this workspace.
4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check.
5. If possible, run openclaw health --json --timeout 10000 after the fix.
6. Print a concise operator summary with:
- root cause
- files changed
- verification run
- remaining risk
Incident summary:
$summary_text
EOF
cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF'
An OpenClaw rescue watchdog detected a runtime failure.
Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.
Workspace:
- Repo: WORKSPACE_DIR_PLACEHOLDER
- Incident dir: INCIDENT_DIR_PLACEHOLDER

Note: switching to a quoted (<<'...') heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:

cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END
...
$summary_text
RESCUE_WATCHDOG_PROMPT_END

This makes accidental closure essentially impossible while keeping all $variable substitutions working.

Prompt To Fix With AI
This is a comment left during a code review.
Path: scripts/claude-rescue-runner.sh
Line: 34-67

Comment:
**Heredoc delimiter could be prematurely closed**

The heredoc uses `EOF` as its delimiter, and `$summary_text` is expanded inside it. If an `openclaw` error message ever produces a line that is exactly the string `EOF` (e.g. a log line like `... error at EOF`), bash will close the heredoc at that line, silently truncating the prompt written to `$PROMPT_FILE`.

Using a less likely delimiter prevents this:

```suggestion
cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF'
An OpenClaw rescue watchdog detected a runtime failure.

Goals:
- Restore reply capability with the smallest safe fix.
- Prefer narrow changes over refactors.
- Do not commit, push, or modify files outside this workspace.
- Use the captured incident artifacts before exploring broadly.

Workspace:
- Repo: WORKSPACE_DIR_PLACEHOLDER
- Incident dir: INCIDENT_DIR_PLACEHOLDER
```

Note: switching to a quoted (`<<'...'`) heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:

```bash
cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END
...
$summary_text
RESCUE_WATCHDOG_PROMPT_END
```

This makes accidental closure essentially impossible while keeping all `$variable` substitutions working.

How can I resolve this? If you propose a fix, please make it concise.

@NickHood1984
Copy link
Copy Markdown
Contributor Author

Addressed the Greptile feedback in d8b47c5.

  • switched the runner invocation from bash -lc to bash -c so the exported OPENCLAW_RESCUE_* environment survives runner launch more predictably
  • changed the Claude prompt heredoc delimiter from EOF to RESCUE_WATCHDOG_PROMPT_END to avoid accidental prompt truncation

Re-checked both scripts with bash -n.

@openclaw-barnacle openclaw-barnacle Bot added the gateway Gateway runtime label Mar 13, 2026
@NickHood1984
Copy link
Copy Markdown
Contributor Author

Added a follow-up fix for the failing merge check in src/gateway/server-methods/chat.abort.test-helpers.ts.

This adds explicit exported helper types so the generated declaration surface no longer depends on inferred @vitest/spy internals (TS2883).

Local verification on this branch:

  • pnpm check

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e1544c663

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +175 to +177
if [ -z "$incident_reason" ]; then
echo "openclaw rescue watchdog: healthy"
exit 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear suppression state after a healthy cycle

When incident_reason is empty, this branch exits immediately without updating state.env, so LAST_FINGERPRINT and LAST_TRIGGERED_AT from a prior outage remain active. In a fail → recover → fail sequence within OPENCLAW_RESCUE_COOLDOWN_SEC, the second outage is incorrectly treated as a duplicate and skips incident capture/runner execution/notification even though it is a new event after recovery.

Useful? React with 👍 / 👎.

@NickHood1984
Copy link
Copy Markdown
Contributor Author

Updated the branch to current main and fixed the resulting test-only type regressions that were failing the merge check.

Local verification on this branch:

  • pnpm check

The watchdog changes are unchanged; this follow-up only touches test typing/inference on top of the current base.

@openclaw-barnacle openclaw-barnacle Bot added the channel: telegram Channel integration: telegram label Mar 13, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22b231e30c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +196 to +198
incident_id="$(date -u +"%Y%m%dT%H%M%SZ")"
incident_dir="$INCIDENT_ROOT/$incident_id"
mkdir -p "$incident_dir"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Generate unique incident IDs per run

When two watchdog invocations detect an incident within the same second (for example, an overlapping cron run plus a manual run), both derive the same incident_id and write into the same directory because the ID is second-granularity and mkdir -p reuses existing paths. Under set -e, this can cause one run to fail during file writes/copies and leave a mixed or partial incident bundle instead of two independent captures.

Useful? React with 👍 / 👎.

@openclaw-barnacle openclaw-barnacle Bot removed the gateway Gateway runtime label Mar 14, 2026
@NickHood1984
Copy link
Copy Markdown
Contributor Author

Synced this branch with the latest upstream main in commit fc6cfc7 and resolved the merge conflicts in the previously added test-only type fixes.\n\nLocal verification on 2026-03-14:\n- pnpm install --frozen-lockfile\n- pnpm tsgo\n- pnpm exec vitest run --config vitest.unit.config.ts src/plugins/install.test.ts src/hooks/install.test.ts src/gateway/server-runtime-config.test.ts src/gateway/server.talk-config.test.ts src/telegram/network-config.test.ts src/infra/outbound/targets.test.ts src/infra/provider-usage.auth.normalizes-keys.test.ts\n- pnpm check\n\nThis also aligns the npm install assertion helper with the current install command shape after the recent upstream changes.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc6cfc7c24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

fi
printf '%s\n' "$runner_status" >"$incident_dir/runner.status"

write_state "$fingerprint" "$now_epoch" "$incident_id"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist suppression state before invoking runner

The watchdog writes state.env only after RUNNER_CMD returns, which leaves the latest fingerprint/cooldown invisible during long rescue runs. If another invocation starts in that window (for example, cron overlap or a manual rerun), it will treat the same outage as new and run a second incident capture/runner cycle, producing duplicate remediation attempts and notifications against the same failure. Save suppression state before launching the runner (or guard with a lock) so overlapping executions are deduplicated immediately.

Useful? React with 👍 / 👎.

@NickHood1984
Copy link
Copy Markdown
Contributor Author

Split the inherited CI fixes out into separate PRs so this feature PR can stay scoped:

Those failures were inherited from main, not introduced by the rescue watchdog changes.

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 27, 2026

Codex review: found issues before merge.

Summary
This PR adds optional external rescue watchdog and Claude runner scripts, systemd user timer templates, rescue watchdog docs, and test-only TypeScript annotation changes.

Reproducibility: yes. for the patch defects by source inspection. The fail-recover-fail cooldown path, same-second incident directory collision, and long-runner overlap follow directly from the proposed shell control flow; I did not execute the scripts because this review was read-only.

Next step before merge
Maintainers need to choose the rescue architecture and security posture across this PR, #46502, and #44441 before an automated repair branch is safe.

Security
Needs attention: The diff adds opt-in command execution and diagnostic capture without enough state-file and incident-data hardening.

Review findings

  • [P2] Clear suppression state after healthy probes — scripts/rescue-watchdog.sh:175-177
  • [P2] Use unique incident directories per run — scripts/rescue-watchdog.sh:196-198
  • [P2] Persist suppression state before launching the runner — scripts/rescue-watchdog.sh:250
Review details

Best possible solution:

Pick one canonical rescue model first; if this external watchdog remains desired, harden state, concurrency, and sensitive-bundle handling, add focused scripted coverage, and split unrelated test typing changes.

Do we have a high-confidence way to reproduce the issue?

Yes for the patch defects by source inspection. The fail-recover-fail cooldown path, same-second incident directory collision, and long-runner overlap follow directly from the proposed shell control flow; I did not execute the scripts because this review was read-only.

Is this the best way to solve the issue?

No. The external watchdog may be useful, but this PR is not the best mergeable form until maintainers choose the rescue architecture and security posture across the overlapping rescue PRs and the concrete script defects are fixed.

Full review comments:

  • [P2] Clear suppression state after healthy probes — scripts/rescue-watchdog.sh:175-177
    When incident_reason is empty, the script exits without updating state.env, so a fail -> recover -> fail sequence inside OPENCLAW_RESCUE_COOLDOWN_SEC can suppress the fresh outage and skip incident capture, runner execution, and notification.
    Confidence: 0.86
  • [P2] Use unique incident directories per run — scripts/rescue-watchdog.sh:196-198
    incident_id has only second-level granularity and mkdir -p reuses an existing directory. Two overlapping timer/manual runs in the same second can write into the same bundle and leave mixed or partial incident artifacts.
    Confidence: 0.88
  • [P2] Persist suppression state before launching the runner — scripts/rescue-watchdog.sh:250
    state.env is saved only after RUNNER_CMD returns. During a long rescue run, another watchdog invocation sees no updated fingerprint and can launch a duplicate runner or notification cycle for the same outage.
    Confidence: 0.89

Overall correctness: patch is incorrect
Overall confidence: 0.89

Security concerns:

  • [medium] Harden sourced watchdog state — scripts/rescue-watchdog.sh:39
    load_state sources state.env from an environment-configurable state directory without regular-file, ownership, permission, or symlink checks. Replacing that file would turn the next watchdog run into shell execution under the operator account.
    Confidence: 0.84
  • [medium] Document sensitive incident bundle handling — docs/automation/rescue-watchdog.md:38
    The docs describe bundles containing health/status snapshots, gateway log tails, git state, diffs, and runner output, and the runner adds that directory to Claude context. Operators need explicit permissions, retention, and sensitive-data guidance before enabling this flow.
    Confidence: 0.82

What I checked:

  • Current main lacks rescue watchdog surface: A targeted current-main search found no OPENCLAW_RESCUE, rescue-watchdog, claude-rescue, or rescueWatchdog implementation under docs, scripts, src, packages, extensions, CHANGELOG.md, or package.json. (88b21427f8ad)
  • Health command remains gateway-dependent: Current main documents and implements health by calling the running gateway; channel issues are reported but explicitly not fatal to the health command. (src/commands/health.ts:545, 88b21427f8ad)
  • Current recovery is in-process: The existing channel health monitor runs inside the gateway process and restarts unhealthy channels through channelManager, so it does not provide the external rescue loop proposed by this PR. (src/gateway/channel-health-monitor.ts:76, 88b21427f8ad)
  • PR defect: healthy cycles do not clear cooldown state: The PR exits immediately on a healthy probe without loading or clearing prior suppression state, so a fail-recover-fail sequence can be skipped as a duplicate inside the cooldown window. (scripts/rescue-watchdog.sh:175, fc6cfc7c24b1)
  • PR defect: incident IDs collide within one second: The PR uses a UTC timestamp with second-level granularity and mkdir -p, allowing overlapping runs in the same second to write into the same incident directory. (scripts/rescue-watchdog.sh:196, fc6cfc7c24b1)
  • PR defect: cooldown state is persisted after runner completion: The PR writes state.env only after RUNNER_CMD returns, so overlapping invocations during a long rescue run can launch duplicate runners and notifications for the same outage. (scripts/rescue-watchdog.sh:250, fc6cfc7c24b1)

Likely related people:

  • steipete: Recent and repeated changes across gateway health/status, channel startup, channel-health policy, and docs health surfaces adjacent to the proposed rescue behavior. (role: recent maintainer; confidence: high; commits: db06fcd990ff, 4dd2768c4b1a, d8d0380297f4; files: src/commands/health.ts, src/gateway/channel-health-monitor.ts, src/gateway/server-channels.ts)
  • vincentkoc: Recent work preserved runtime-backed health state, channel readiness probes, channel runtime surfaces, and gateway health docs that define the current behavior this PR monitors externally. (role: adjacent owner; confidence: high; commits: be6263da4f51, d74f897c1c6b, ab5fcfcc0128; files: src/commands/health.ts, src/gateway/channel-health-monitor.ts, src/gateway/server-channels.ts)
  • rstar327: Added configurable channel-health monitor timing and restart limits, which is central adjacent behavior for deciding how external rescue should coexist with in-process recovery. (role: feature-history contributor; confidence: medium; commits: ba6064cc2256; files: src/gateway/channel-health-monitor.ts)
  • derankin: Introduced stale Slack socket detection and health-monitor auto-restart behavior, an earlier version of the channel recovery model this PR tries to supplement. (role: introduced adjacent behavior; confidence: medium; commits: a28a4b1b619a; files: src/gateway/channel-health-monitor.ts)

Remaining risk / open question:

Codex review notes: model gpt-5.5, reasoning high; reviewed against 88b21427f8ad.

@openclaw-barnacle openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram docs Improvements or additions to documentation scripts Repository scripts size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants