feat: add external rescue watchdog by NickHood1984 · Pull Request #40791 · openclaw/openclaw

NickHood1984 · 2026-03-09T08:50:09Z

Summary

Problem: OpenClaw can report health and channel errors, but there was no external rescue loop that stays alive when the gateway itself is unhealthy.
Why it matters: in-process recovery dies with the same process, so failures that stop replies had to be diagnosed and handed off manually.
What changed: added an external rescue watchdog script, a Claude Code runner, systemd user service/timer templates, and setup docs.
What did NOT change (scope boundary): no core gateway runtime behavior, no automatic commit/push flow, and no bundled launchd plist.

Change Type (select all)

Feature
Docs
Chore/infra

Scope (select all touched areas)

Gateway / orchestration
UI / DX
CI/CD / infra

Linked Issue/PR

Related #

User-visible / Behavior Changes

Added scripts/rescue-watchdog.sh to detect openclaw health --json failures and channel lastError snapshots, capture incident bundles, de-duplicate repeats, optionally notify, and optionally invoke a rescue runner.
Added scripts/claude-rescue-runner.sh as a minimal Claude Code handoff runner for captured incidents.
Added scripts/systemd/openclaw-rescue-watchdog.{service,timer} templates.
Added docs/automation/rescue-watchdog.md and linked it from auth monitoring docs.

Security Impact (required)

New permissions/capabilities? (Yes)
Secrets/tokens handling changed? (No)
New/changed network calls? (Yes)
Command/tool execution surface changed? (Yes)
Data access scope changed? (Yes)
If any Yes, explain risk + mitigation:
The watchdog and runner are optional, external scripts. They execute only when explicitly installed/configured, default to no notification target and no runner, and keep the repair path outside the gateway process. The Claude runner is conservative by default (acceptEdits) and does not commit or push changes.

Repro + Verification

Environment

OS: macOS
Runtime/container: local source checkout
Model/provider: Claude Code CLI runner path validated with a stubbed CLAUDE_BIN=echo
Integration/channel (if any): none configured for notification during verification
Relevant config (redacted): synthetic incident via OPENCLAW_BIN=false

Steps

Run bash -n scripts/rescue-watchdog.sh and bash -n scripts/claude-rescue-runner.sh.
Trigger a synthetic incident with OPENCLAW_BIN=false OPENCLAW_RESCUE_STATE_DIR="$PWD/.artifacts/rescue-watchdog-test" ./scripts/rescue-watchdog.sh.
Re-run the same command to verify duplicate suppression.
Run OPENCLAW_RESCUE_INCIDENT_DIR=... OPENCLAW_RESCUE_WORKSPACE_DIR="$PWD" CLAUDE_BIN=echo ./scripts/claude-rescue-runner.sh to validate prompt/output generation.

Expected

The watchdog writes an incident bundle on first failure.
Repeated identical incidents within cooldown are skipped.
The Claude runner writes claude-prompt.txt and claude-output.txt.

Actual

Matched expected behavior.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: shell syntax checks, synthetic incident capture, duplicate suppression, runner prompt/output generation.
Edge cases checked: health command failure path, repeated identical incident suppression, runner not configured path.
What you did not verify: a live end-to-end repair against a real broken gateway or a real notification channel.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (Yes)
Migration needed? (No)
If yes, exact upgrade steps:
Set the optional watchdog/runner environment variables and enable the provided systemd timer if you want the rescue flow.

Failure Recovery (if this breaks)

How to disable/revert this change quickly: do not install/enable the timer, or remove the copied user systemd unit files.
Files/config to restore: remove scripts/systemd/openclaw-rescue-watchdog.{service,timer} from the user systemd dir and unset OPENCLAW_RESCUE_* / CLAUDE_RESCUE_* env vars.
Known bad symptoms reviewers should watch for: unexpected repeated incident notifications due to local environment differences, or runner failures because claude is unavailable on PATH.

Risks and Mitigations

Risk: operators may assume the rescue path is active by default.
Mitigation: the docs explicitly describe the flow as optional and external.
Risk: the runner may be too permissive for some environments.
Mitigation: the sample runner defaults to acceptEdits, does not commit/push, and is fully configurable via env vars.

greptile-apps · 2026-03-09T08:52:49Z

Greptile Summary

This PR adds an optional external rescue watchdog (scripts/rescue-watchdog.sh) that polls openclaw health, captures incident bundles on failure, deduplicates repeat events, optionally sends notifications, and can hand the incident off to a Claude Code runner (scripts/claude-rescue-runner.sh). Systemd user timer templates and setup docs are also included. The change is additive and backward compatible — nothing runs unless explicitly installed.

Key findings:

scripts/rescue-watchdog.sh line 240: the runner is launched via bash -lc "$RUNNER_CMD" (login shell). On many systems profile scripts (.bash_profile, etc.) reset PATH or unset environment variables, including the OPENCLAW_RESCUE_* variables the watchdog exports. If those variables are cleared, claude-rescue-runner.sh exits with OPENCLAW_RESCUE_INCIDENT_DIR is required. Using bash -c instead of bash -lc would be safer unless login-shell PATH resolution is explicitly needed.
scripts/claude-rescue-runner.sh lines 34–67: the heredoc uses EOF as its delimiter while expanding $summary_text inside it. Although summary_text is currently always single-line, any future openclaw error message that produces a bare EOF line would silently truncate the Claude prompt. Using a less common delimiter (e.g. RESCUE_WATCHDOG_PROMPT_END) eliminates this edge case at no cost.

Confidence Score: 3/5

Safe to merge with minor fixes; the login-shell runner invocation could silently break the rescue path in certain environments.
The change is fully opt-in and additive — no existing behavior is altered. The two flagged issues are low-to-moderate: the bash -lc runner invocation is a real risk in environments with profile-script side effects, and the heredoc delimiter is a defensive concern rather than a confirmed bug. Neither blocks the happy path described in the PR, but the bash -lc issue could cause the runner to fail silently in CI or containerized setups where profile scripts reset the environment.
Pay close attention to scripts/rescue-watchdog.sh (runner invocation at line 240) and scripts/claude-rescue-runner.sh (heredoc delimiter at lines 34–67).

_{Last reviewed commit: 817a95c}

greptile-apps · 2026-03-09T08:52:52Z

+  if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then
+    runner_status="ok"
+  else
+    runner_status="failed:$?"
+  fi


Login shell may reset environment for runner

bash -lc "$RUNNER_CMD" spawns a login shell, which sources /etc/profile, ~/.bash_profile, and similar profile scripts. In some environments these profile scripts reset PATH, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. OPENCLAW_RESCUE_INCIDENT_DIR, OPENCLAW_RESCUE_WORKSPACE_DIR). If those variables disappear, the runner will fail with OPENCLAW_RESCUE_INCIDENT_DIR is required.

For the typical operator use-case where RUNNER_CMD is simply a script path, using bash -c (non-login) or direct execution would be more predictable:

Suggested change

if bash -lc "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then

runner_status="ok"

else

runner_status="failed:$?"

fi

if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then

runner_status="ok"

else

runner_status="failed:$?"

fi

If login-shell PATH resolution is intentionally needed (e.g. so the runner can find claude installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead.

Prompt To Fix With AI

This is a comment left during a code review. Path: scripts/rescue-watchdog.sh Line: 240-244 Comment: **Login shell may reset environment for runner** `bash -lc "$RUNNER_CMD"` spawns a login shell, which sources `/etc/profile`, `~/.bash_profile`, and similar profile scripts. In some environments these profile scripts reset `PATH`, unset exported variables, or override other env vars that the watchdog carefully exports (e.g. `OPENCLAW_RESCUE_INCIDENT_DIR`, `OPENCLAW_RESCUE_WORKSPACE_DIR`). If those variables disappear, the runner will fail with `OPENCLAW_RESCUE_INCIDENT_DIR is required`. For the typical operator use-case where `RUNNER_CMD` is simply a script path, using `bash -c` (non-login) or direct execution would be more predictable: ```suggestion if bash -c "$RUNNER_CMD" >"$incident_dir/runner.stdout" 2>"$incident_dir/runner.stderr"; then runner_status="ok" else runner_status="failed:$?" fi ``` If login-shell PATH resolution is intentionally needed (e.g. so the runner can find `claude` installed via a shell profile), consider documenting this expectation, or have the runner resolve its own PATH instead. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-09T08:52:53Z

+cat >"$PROMPT_FILE" <<EOF
+An OpenClaw rescue watchdog detected a runtime failure.
+
+Goals:
+- Restore reply capability with the smallest safe fix.
+- Prefer narrow changes over refactors.
+- Do not commit, push, or modify files outside this workspace.
+- Use the captured incident artifacts before exploring broadly.
+
+Workspace:
+- Repo: $WORKSPACE_DIR
+- Incident dir: $INCIDENT_DIR
+
+Artifacts:
+- Summary: $SUMMARY_FILE
+- Health snapshot: $HEALTH_FILE
+- CLI status: $STATUS_FILE
+- Gateway log tail: $LOG_FILE
+
+Required steps:
+1. Read the incident artifacts first.
+2. Inspect only the code paths suggested by the incident.
+3. If you find a safe fix, apply it in this workspace.
+4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check.
+5. If possible, run openclaw health --json --timeout 10000 after the fix.
+6. Print a concise operator summary with:
+   - root cause
+   - files changed
+   - verification run
+   - remaining risk
+
+Incident summary:
+$summary_text
+EOF


Heredoc delimiter could be prematurely closed

The heredoc uses EOF as its delimiter, and $summary_text is expanded inside it. If an openclaw error message ever produces a line that is exactly the string EOF (e.g. a log line like ... error at EOF), bash will close the heredoc at that line, silently truncating the prompt written to $PROMPT_FILE.

Using a less likely delimiter prevents this:

Suggested change

cat >"$PROMPT_FILE" <<EOF

An OpenClaw rescue watchdog detected a runtime failure.

Goals:

- Restore reply capability with the smallest safe fix.

- Prefer narrow changes over refactors.

- Do not commit, push, or modify files outside this workspace.

- Use the captured incident artifacts before exploring broadly.

Workspace:

- Repo: $WORKSPACE_DIR

- Incident dir: $INCIDENT_DIR

Artifacts:

- Summary: $SUMMARY_FILE

- Health snapshot: $HEALTH_FILE

- CLI status: $STATUS_FILE

- Gateway log tail: $LOG_FILE

Required steps:

1. Read the incident artifacts first.

2. Inspect only the code paths suggested by the incident.

3. If you find a safe fix, apply it in this workspace.

4. Run focused verification. Prefer pnpm build; if that is too expensive, explain why and run a narrower check.

5. If possible, run openclaw health --json --timeout 10000 after the fix.

6. Print a concise operator summary with:

- root cause

- files changed

- verification run

- remaining risk

Incident summary:

$summary_text

EOF

cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF'

An OpenClaw rescue watchdog detected a runtime failure.

Goals:

- Restore reply capability with the smallest safe fix.

- Prefer narrow changes over refactors.

- Do not commit, push, or modify files outside this workspace.

- Use the captured incident artifacts before exploring broadly.

Workspace:

- Repo: WORKSPACE_DIR_PLACEHOLDER

- Incident dir: INCIDENT_DIR_PLACEHOLDER

Note: switching to a quoted (<<'...') heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter:

cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END ... $summary_text RESCUE_WATCHDOG_PROMPT_END

This makes accidental closure essentially impossible while keeping all $variable substitutions working.

Prompt To Fix With AI

This is a comment left during a code review. Path: scripts/claude-rescue-runner.sh Line: 34-67 Comment: **Heredoc delimiter could be prematurely closed** The heredoc uses `EOF` as its delimiter, and `$summary_text` is expanded inside it. If an `openclaw` error message ever produces a line that is exactly the string `EOF` (e.g. a log line like `... error at EOF`), bash will close the heredoc at that line, silently truncating the prompt written to `$PROMPT_FILE`. Using a less likely delimiter prevents this: ```suggestion cat >"$PROMPT_FILE" <<'CLAUDE_PROMPT_EOF' An OpenClaw rescue watchdog detected a runtime failure. Goals: - Restore reply capability with the smallest safe fix. - Prefer narrow changes over refactors. - Do not commit, push, or modify files outside this workspace. - Use the captured incident artifacts before exploring broadly. Workspace: - Repo: WORKSPACE_DIR_PLACEHOLDER - Incident dir: INCIDENT_DIR_PLACEHOLDER ``` Note: switching to a quoted (`<<'...'`) heredoc disables variable expansion. A safer approach that keeps expansion is to use a unique, non-guessable delimiter: ```bash cat >"$PROMPT_FILE" <<RESCUE_WATCHDOG_PROMPT_END ... $summary_text RESCUE_WATCHDOG_PROMPT_END ``` This makes accidental closure essentially impossible while keeping all `$variable` substitutions working. How can I resolve this? If you propose a fix, please make it concise.

NickHood1984 · 2026-03-13T17:48:29Z

Addressed the Greptile feedback in d8b47c5.

switched the runner invocation from bash -lc to bash -c so the exported OPENCLAW_RESCUE_* environment survives runner launch more predictably
changed the Claude prompt heredoc delimiter from EOF to RESCUE_WATCHDOG_PROMPT_END to avoid accidental prompt truncation

Re-checked both scripts with bash -n.

NickHood1984 · 2026-03-13T18:08:25Z

Added a follow-up fix for the failing merge check in src/gateway/server-methods/chat.abort.test-helpers.ts.

This adds explicit exported helper types so the generated declaration surface no longer depends on inferred @vitest/spy internals (TS2883).

Local verification on this branch:

pnpm check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e1544c663

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T18:14:12Z

+if [ -z "$incident_reason" ]; then
+  echo "openclaw rescue watchdog: healthy"
+  exit 0


Clear suppression state after a healthy cycle

When incident_reason is empty, this branch exits immediately without updating state.env, so LAST_FINGERPRINT and LAST_TRIGGERED_AT from a prior outage remain active. In a fail → recover → fail sequence within OPENCLAW_RESCUE_COOLDOWN_SEC, the second outage is incorrectly treated as a duplicate and skips incident capture/runner execution/notification even though it is a new event after recovery.

Useful? React with 👍 / 👎.

NickHood1984 · 2026-03-13T18:31:42Z

Updated the branch to current main and fixed the resulting test-only type regressions that were failing the merge check.

Local verification on this branch:

pnpm check

The watchdog changes are unchanged; this follow-up only touches test typing/inference on top of the current base.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22b231e30c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T18:38:50Z

+incident_id="$(date -u +"%Y%m%dT%H%M%SZ")"
+incident_dir="$INCIDENT_ROOT/$incident_id"
+mkdir -p "$incident_dir"


Generate unique incident IDs per run

When two watchdog invocations detect an incident within the same second (for example, an overlapping cron run plus a manual run), both derive the same incident_id and write into the same directory because the ID is second-granularity and mkdir -p reuses existing paths. Under set -e, this can cause one run to fail during file writes/copies and leave a mixed or partial incident bundle instead of two independent captures.

Useful? React with 👍 / 👎.

NickHood1984 · 2026-03-14T03:29:19Z

Synced this branch with the latest upstream main in commit fc6cfc7 and resolved the merge conflicts in the previously added test-only type fixes.\n\nLocal verification on 2026-03-14:\n- pnpm install --frozen-lockfile\n- pnpm tsgo\n- pnpm exec vitest run --config vitest.unit.config.ts src/plugins/install.test.ts src/hooks/install.test.ts src/gateway/server-runtime-config.test.ts src/gateway/server.talk-config.test.ts src/telegram/network-config.test.ts src/infra/outbound/targets.test.ts src/infra/provider-usage.auth.normalizes-keys.test.ts\n- pnpm check\n\nThis also aligns the npm install assertion helper with the current install command shape after the recent upstream changes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc6cfc7c24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-14T03:36:02Z

+fi
+printf '%s\n' "$runner_status" >"$incident_dir/runner.status"
+
+write_state "$fingerprint" "$now_epoch" "$incident_id"


Persist suppression state before invoking runner

The watchdog writes state.env only after RUNNER_CMD returns, which leaves the latest fingerprint/cooldown invisible during long rescue runs. If another invocation starts in that window (for example, cron overlap or a manual rerun), it will treat the same outage as new and run a second incident capture/runner cycle, producing duplicate remediation attempts and notifications against the same failure. Save suppression state before launching the runner (or guard with a lock) so overlapping executions are deduplicated immediately.

Useful? React with 👍 / 👎.

NickHood1984 · 2026-03-14T08:11:06Z

Split the inherited CI fixes out into separate PRs so this feature PR can stay scoped:

extensions/diffs: fix: load pierre themes without json module imports #45869
Windows path/runtime tests: fix: align windows path tests with runtime behavior #45870

Those failures were inherited from main, not introduced by the rescue watchdog changes.

openclaw-barnacle · 2026-04-25T04:25:53Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

clawsweeper · 2026-04-27T10:44:35Z

Codex review: found issues before merge.

Summary
This PR adds optional external rescue watchdog and Claude runner scripts, systemd user timer templates, rescue watchdog docs, and test-only TypeScript annotation changes.

Reproducibility: yes. for the patch defects by source inspection. The fail-recover-fail cooldown path, same-second incident directory collision, and long-runner overlap follow directly from the proposed shell control flow; I did not execute the scripts because this review was read-only.

Next step before merge
Maintainers need to choose the rescue architecture and security posture across this PR, #46502, and #44441 before an automated repair branch is safe.

Security
Needs attention: The diff adds opt-in command execution and diagnostic capture without enough state-file and incident-data hardening.

Review findings

[P2] Clear suppression state after healthy probes — scripts/rescue-watchdog.sh:175-177
[P2] Use unique incident directories per run — scripts/rescue-watchdog.sh:196-198
[P2] Persist suppression state before launching the runner — scripts/rescue-watchdog.sh:250

Review details

Best possible solution:

Pick one canonical rescue model first; if this external watchdog remains desired, harden state, concurrency, and sensitive-bundle handling, add focused scripted coverage, and split unrelated test typing changes.

Do we have a high-confidence way to reproduce the issue?

Yes for the patch defects by source inspection. The fail-recover-fail cooldown path, same-second incident directory collision, and long-runner overlap follow directly from the proposed shell control flow; I did not execute the scripts because this review was read-only.

Is this the best way to solve the issue?

No. The external watchdog may be useful, but this PR is not the best mergeable form until maintainers choose the rescue architecture and security posture across the overlapping rescue PRs and the concrete script defects are fixed.

Full review comments:

[P2] Clear suppression state after healthy probes — scripts/rescue-watchdog.sh:175-177
When incident_reason is empty, the script exits without updating state.env, so a fail -> recover -> fail sequence inside OPENCLAW_RESCUE_COOLDOWN_SEC can suppress the fresh outage and skip incident capture, runner execution, and notification.
Confidence: 0.86
[P2] Use unique incident directories per run — scripts/rescue-watchdog.sh:196-198
incident_id has only second-level granularity and mkdir -p reuses an existing directory. Two overlapping timer/manual runs in the same second can write into the same bundle and leave mixed or partial incident artifacts.
Confidence: 0.88
[P2] Persist suppression state before launching the runner — scripts/rescue-watchdog.sh:250
state.env is saved only after RUNNER_CMD returns. During a long rescue run, another watchdog invocation sees no updated fingerprint and can launch a duplicate runner or notification cycle for the same outage.
Confidence: 0.89

Overall correctness: patch is incorrect
Overall confidence: 0.89

Security concerns:

[medium] Harden sourced watchdog state — scripts/rescue-watchdog.sh:39
load_state sources state.env from an environment-configurable state directory without regular-file, ownership, permission, or symlink checks. Replacing that file would turn the next watchdog run into shell execution under the operator account.
Confidence: 0.84
[medium] Document sensitive incident bundle handling — docs/automation/rescue-watchdog.md:38
The docs describe bundles containing health/status snapshots, gateway log tails, git state, diffs, and runner output, and the runner adds that directory to Claude context. Operators need explicit permissions, retention, and sensitive-data guidance before enabling this flow.
Confidence: 0.82

What I checked:

Current main lacks rescue watchdog surface: A targeted current-main search found no OPENCLAW_RESCUE, rescue-watchdog, claude-rescue, or rescueWatchdog implementation under docs, scripts, src, packages, extensions, CHANGELOG.md, or package.json. (88b21427f8ad)
Health command remains gateway-dependent: Current main documents and implements health by calling the running gateway; channel issues are reported but explicitly not fatal to the health command. (src/commands/health.ts:545, 88b21427f8ad)
Current recovery is in-process: The existing channel health monitor runs inside the gateway process and restarts unhealthy channels through channelManager, so it does not provide the external rescue loop proposed by this PR. (src/gateway/channel-health-monitor.ts:76, 88b21427f8ad)
PR defect: healthy cycles do not clear cooldown state: The PR exits immediately on a healthy probe without loading or clearing prior suppression state, so a fail-recover-fail sequence can be skipped as a duplicate inside the cooldown window. (scripts/rescue-watchdog.sh:175, fc6cfc7c24b1)
PR defect: incident IDs collide within one second: The PR uses a UTC timestamp with second-level granularity and mkdir -p, allowing overlapping runs in the same second to write into the same incident directory. (scripts/rescue-watchdog.sh:196, fc6cfc7c24b1)
PR defect: cooldown state is persisted after runner completion: The PR writes state.env only after RUNNER_CMD returns, so overlapping invocations during a long rescue run can launch duplicate runners and notifications for the same outage. (scripts/rescue-watchdog.sh:250, fc6cfc7c24b1)

Likely related people:

steipete: Recent and repeated changes across gateway health/status, channel startup, channel-health policy, and docs health surfaces adjacent to the proposed rescue behavior. (role: recent maintainer; confidence: high; commits: db06fcd990ff, 4dd2768c4b1a, d8d0380297f4; files: src/commands/health.ts, src/gateway/channel-health-monitor.ts, src/gateway/server-channels.ts)
vincentkoc: Recent work preserved runtime-backed health state, channel readiness probes, channel runtime surfaces, and gateway health docs that define the current behavior this PR monitors externally. (role: adjacent owner; confidence: high; commits: be6263da4f51, d74f897c1c6b, ab5fcfcc0128; files: src/commands/health.ts, src/gateway/channel-health-monitor.ts, src/gateway/server-channels.ts)
rstar327: Added configurable channel-health monitor timing and restart limits, which is central adjacent behavior for deciding how external rescue should coexist with in-process recovery. (role: feature-history contributor; confidence: medium; commits: ba6064cc2256; files: src/gateway/channel-health-monitor.ts)
derankin: Introduced stale Slack socket detection and health-monitor auto-restart behavior, an earlier version of the channel recovery model this PR tries to supplement. (role: introduced adjacent behavior; confidence: medium; commits: a28a4b1b619a; files: src/gateway/channel-health-monitor.ts)

Remaining risk / open question:

The rescue architecture is unsettled between this external shell watchdog, the open core rescue service PR Rescue: add watchdog core service and cron engine #46502, and the open docs-only runbook PR docs: add optional macOS gateway watchdog recovery runbook (AI-assisted) #44441.
The proposed state file and incident bundle handling introduce command-execution and sensitive-data exposure concerns.
The branch still carries test-only TypeScript annotation changes that are unrelated to the watchdog and partly split into fix: load pierre themes without json module imports #45869/fix: align windows path tests with runtime behavior #45870.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 88b21427f8ad.

feat: add external rescue watchdog

817a95c

openclaw-barnacle Bot added docs Improvements or additions to documentation scripts Repository scripts size: M labels Mar 9, 2026

greptile-apps Bot reviewed Mar 9, 2026

View reviewed changes

fix: address rescue watchdog review comments

d8b47c5

fix: annotate chat abort test helpers

8e1544c

openclaw-barnacle Bot added the gateway Gateway runtime label Mar 13, 2026

chatgpt-codex-connector Bot reviewed Mar 13, 2026

View reviewed changes

NIckhood-123 added 2 commits March 14, 2026 02:26

Merge remote-tracking branch 'upstream/main' into codex/rescue-watchdog

5429c7d

test: fix type coverage after main merge

22b231e

openclaw-barnacle Bot added the channel: telegram Channel integration: telegram label Mar 13, 2026

chatgpt-codex-connector Bot reviewed Mar 13, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into codex/rescue-watchdog

fc6cfc7

openclaw-barnacle Bot removed the gateway Gateway runtime label Mar 14, 2026

chatgpt-codex-connector Bot reviewed Mar 14, 2026

View reviewed changes

This was referenced Mar 14, 2026

fix: load pierre themes without json module imports #45869

Merged

fix: align windows path tests with runtime behavior #45870

Open

openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 25, 2026

This was referenced Apr 27, 2026

feat: add external rescue watchdog BingqingLyu/openclaw#233

Open

fix: align windows path tests with runtime behavior BingqingLyu/openclaw#674

Open

openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Apr 28, 2026

This was referenced Apr 28, 2026

Rescue: add watchdog core service and cron engine #46502

Open

docs: add optional macOS gateway watchdog recovery runbook (AI-assisted) #44441

Open

Uh oh!

Conversation

NickHood1984 commented Mar 9, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

greptile-apps Bot commented Mar 9, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

NickHood1984 commented Mar 13, 2026

Uh oh!

NickHood1984 commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

NickHood1984 commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

NickHood1984 commented Mar 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

NickHood1984 commented Mar 14, 2026

Uh oh!

openclaw-barnacle Bot commented Apr 25, 2026

Uh oh!

clawsweeper Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clawsweeper Bot commented Apr 27, 2026 •

edited

Loading