Skip to content

docs: add optional macOS gateway watchdog recovery runbook (AI-assisted)#44441

Open
gitshrimp wants to merge 2 commits intoopenclaw:mainfrom
gitshrimp:docs/mac-watchdog-recovery
Open

docs: add optional macOS gateway watchdog recovery runbook (AI-assisted)#44441
gitshrimp wants to merge 2 commits intoopenclaw:mainfrom
gitshrimp:docs/mac-watchdog-recovery

Conversation

@gitshrimp
Copy link
Copy Markdown

Summary

  • document an optional operator-managed macOS watchdog pattern for Gateway recovery
  • explain when launchd alone is enough vs when a second health-based watchdog may help
  • link the macOS runbook from the updating guide

Why

OpenClaw already relies on launchd/systemd for normal crash restart behavior. This PR does not change product behavior. It adds docs for advanced operators who want a separate, conservative recovery layer for cases where the Gateway process is still present but no longer healthy.

Scope / guardrails

  • docs only
  • macOS only
  • explicitly framed as optional and operator-managed
  • keeps recovery order conservative: openclaw doctor -> openclaw doctor --fix -> openclaw gateway restart
  • recommends local-only/loopback deployments and sparse probe intervals

Testing

  • lightly tested
  • validated commands/labels against current docs and local CLI behavior
  • no runtime code changes

AI assistance

This PR was AI-assisted. I reviewed and edited the final text before opening.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation app: macos App: macos size: XS labels Mar 12, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 12, 2026

Greptile Summary

This PR adds optional documentation for an operator-managed macOS launchd watchdog that monitors Gateway health and attempts recovery when the process is alive but unhealthy. The scope is correctly limited to docs, clearly framed as optional, and the recovery order (doctordoctor --fixgateway restart) is conservative and sensible.

Key findings:

  • The example watchdog script invokes openclaw as a bare command. launchd runs jobs with a very restricted default PATH (/usr/bin:/bin:/usr/sbin:/sbin), so npm global binaries (installed to /usr/local/bin, nvm shims, etc.) will not be found. Every openclaw call will silently exit with "command not found" (exit code 127), and because each remediation line uses || true, the failures are swallowed. Operators will believe the watchdog is active and protecting them when it is doing nothing. The example should use a full absolute path or inject the correct PATH via EnvironmentVariables in the plist.
  • The cross-reference added in updating.md is clean and the anchor slug correctly matches the new section heading.
  • The launchctl bootstrap/bootout commands, the plist structure, and the StartInterval: 60 interval are all correct.

Confidence Score: 3/5

  • Docs-only PR; safe to merge after addressing the PATH issue in the watchdog script example, otherwise operators will set up a watchdog that silently never fires.
  • The change is entirely documentation, so there is no risk to runtime behavior. However, the watchdog script example has a real usability defect (bare openclaw command will not be found by launchd) that would cause the feature to silently not work for most users without warning. This needs to be fixed before the docs are published to avoid giving operators a false sense of security.
  • docs/platforms/mac/bundled-gateway.md — the watchdog script's PATH handling needs to be addressed.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/platforms/mac/bundled-gateway.md
Line: 93-102

Comment:
**`openclaw` may not be on launchd's PATH**

When launchd executes this script, its default `PATH` is very restricted (typically `/usr/bin:/bin:/usr/sbin:/sbin`). npm global binaries are usually installed in locations like `/usr/local/bin`, `~/.npm-global/bin`, or inside an nvm/nodenv shim directory — none of which are in launchd's default PATH.

As written, every `openclaw` call will produce a "command not found" error (exit code 127). Because the remediation lines all end in `|| true`, those failures are silently swallowed. The watchdog will appear to run successfully (launchd shows it as active) but will do absolutely nothing, leaving operators with a false sense of security.

The fix is to either resolve the binary path explicitly in the script, or inject it via the plist's `EnvironmentVariables`:

**Option A – resolve path at the top of the script:**
```suggestion
#!/bin/zsh
set -euo pipefail

# Resolve openclaw binary — adjust if you installed via nvm, nodenv, pnpm, etc.
OPENCLAW_BIN="${OPENCLAW_BIN:-/usr/local/bin/openclaw}"

if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then
  exit 0
fi

"$OPENCLAW_BIN" doctor >/dev/null 2>&1 || true
"$OPENCLAW_BIN" doctor --fix >/dev/null 2>&1 || true
"$OPENCLAW_BIN" gateway restart >/dev/null 2>&1 || true
```

**Option B – extend PATH inside the plist** (add before `StartInterval`):
```xml
<key>EnvironmentVariables</key>
<dict>
  <key>PATH</key>
  <string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
```

At minimum the docs should warn readers to verify (e.g. `which openclaw`) and substitute the correct absolute path for their install.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: fe7ef93

Comment thread docs/platforms/mac/bundled-gateway.md Outdated
Comment on lines +93 to +102
#!/bin/zsh
set -euo pipefail

if openclaw health --json >/dev/null 2>&1; then
exit 0
fi

openclaw doctor >/dev/null 2>&1 || true
openclaw doctor --fix >/dev/null 2>&1 || true
openclaw gateway restart >/dev/null 2>&1 || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openclaw may not be on launchd's PATH

When launchd executes this script, its default PATH is very restricted (typically /usr/bin:/bin:/usr/sbin:/sbin). npm global binaries are usually installed in locations like /usr/local/bin, ~/.npm-global/bin, or inside an nvm/nodenv shim directory — none of which are in launchd's default PATH.

As written, every openclaw call will produce a "command not found" error (exit code 127). Because the remediation lines all end in || true, those failures are silently swallowed. The watchdog will appear to run successfully (launchd shows it as active) but will do absolutely nothing, leaving operators with a false sense of security.

The fix is to either resolve the binary path explicitly in the script, or inject it via the plist's EnvironmentVariables:

Option A – resolve path at the top of the script:

Suggested change
#!/bin/zsh
set -euo pipefail
if openclaw health --json >/dev/null 2>&1; then
exit 0
fi
openclaw doctor >/dev/null 2>&1 || true
openclaw doctor --fix >/dev/null 2>&1 || true
openclaw gateway restart >/dev/null 2>&1 || true
#!/bin/zsh
set -euo pipefail
# Resolve openclaw binary — adjust if you installed via nvm, nodenv, pnpm, etc.
OPENCLAW_BIN="${OPENCLAW_BIN:-/usr/local/bin/openclaw}"
if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then
exit 0
fi
"$OPENCLAW_BIN" doctor >/dev/null 2>&1 || true
"$OPENCLAW_BIN" doctor --fix >/dev/null 2>&1 || true
"$OPENCLAW_BIN" gateway restart >/dev/null 2>&1 || true

Option B – extend PATH inside the plist (add before StartInterval):

<key>EnvironmentVariables</key>
<dict>
  <key>PATH</key>
  <string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>

At minimum the docs should warn readers to verify (e.g. which openclaw) and substitute the correct absolute path for their install.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/platforms/mac/bundled-gateway.md
Line: 93-102

Comment:
**`openclaw` may not be on launchd's PATH**

When launchd executes this script, its default `PATH` is very restricted (typically `/usr/bin:/bin:/usr/sbin:/sbin`). npm global binaries are usually installed in locations like `/usr/local/bin`, `~/.npm-global/bin`, or inside an nvm/nodenv shim directory — none of which are in launchd's default PATH.

As written, every `openclaw` call will produce a "command not found" error (exit code 127). Because the remediation lines all end in `|| true`, those failures are silently swallowed. The watchdog will appear to run successfully (launchd shows it as active) but will do absolutely nothing, leaving operators with a false sense of security.

The fix is to either resolve the binary path explicitly in the script, or inject it via the plist's `EnvironmentVariables`:

**Option A – resolve path at the top of the script:**
```suggestion
#!/bin/zsh
set -euo pipefail

# Resolve openclaw binary — adjust if you installed via nvm, nodenv, pnpm, etc.
OPENCLAW_BIN="${OPENCLAW_BIN:-/usr/local/bin/openclaw}"

if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then
  exit 0
fi

"$OPENCLAW_BIN" doctor >/dev/null 2>&1 || true
"$OPENCLAW_BIN" doctor --fix >/dev/null 2>&1 || true
"$OPENCLAW_BIN" gateway restart >/dev/null 2>&1 || true
```

**Option B – extend PATH inside the plist** (add before `StartInterval`):
```xml
<key>EnvironmentVariables</key>
<dict>
  <key>PATH</key>
  <string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
```

At minimum the docs should warn readers to verify (e.g. `which openclaw`) and substitute the correct absolute path for their install.

How can I resolve this? If you propose a fix, please make it concise.

@gitshrimp gitshrimp force-pushed the docs/mac-watchdog-recovery branch from fe7ef93 to 500930c Compare March 12, 2026 23:23
@gitshrimp
Copy link
Copy Markdown
Author

Follow-up after local validation:

  • I tightened the wording so this is explicitly scoped to local macOS LaunchAgent setups.
  • I verified that Docker/container deployments already have a documented HEALTHCHECK + restart-policy path in docs/install/docker.md and docker-compose.yml.
  • I verified that Linux/systemd deployments already document Restart=-based recovery.

So this PR is intentionally not presenting the watchdog as a cross-platform default; it documents an optional operator-managed runbook for a narrower macOS local-service case.

@gitshrimp
Copy link
Copy Markdown
Author

gitshrimp commented Mar 12, 2026

Good catch — launchd's default PATH was the real footgun here, and the original example could indeed fail quietly while still looking "installed".

I updated the docs to make that failure mode much less likely:

  • the script now uses OPENCLAW_BIN with an absolute default path instead of a bare openclaw command
  • the docs explicitly tell the operator to verify the path first with which openclaw
  • the LaunchAgent example now also sets EnvironmentVariables.PATH as a second layer of protection

So the example no longer depends on launchd inheriting a friendly shell PATH, which was the weak spot in the first draft.

Thanks — this was a real issue, not just wording.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eabc7cf636

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# (for example nvm, pnpm, Homebrew on Apple Silicon, or a custom prefix).
OPENCLAW_BIN="${OPENCLAW_BIN:-/opt/homebrew/bin/openclaw}"

if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use a failing health predicate in watchdog example

The watchdog gate uses openclaw health --json, but that command treats successful RPC reachability as success and does not fail on degraded channel state, so a gateway that is still responsive but unhealthy will exit 0 and skip the recovery block. In practice this means the documented “alive but unhealthy” watchdog path won’t run for common degraded-but-responsive cases, only for hard failures like unreachable/timeouts.

Useful? React with 👍 / 👎.

@gitshrimp gitshrimp force-pushed the docs/mac-watchdog-recovery branch from eabc7cf to fc7424c Compare March 19, 2026 22:35
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc7424cccb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +133 to +136
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve profile env in the watchdog LaunchAgent example

This LaunchAgent only forwards PATH, so the watchdog's openclaw health / doctor / gateway restart calls will always use the default config and service. That breaks the non-default-profile setups this page already acknowledges (ai.openclaw.<profile> earlier in the file): the watchdog will probe or restart the wrong gateway, and src/cli/daemon-cli/status.print.ts:129-138 already treats profile/state-dir mismatches as a real service mismatch. Please show OPENCLAW_PROFILE (and, if relevant, OPENCLAW_CONFIG_PATH) propagation here so the example targets the intended service.

Useful? React with 👍 / 👎.

@gitshrimp
Copy link
Copy Markdown
Author

Quick follow-up after the latest push:

  • rebased onto current main
  • resolved the docs/install/updating.md conflict against the newer update docs layout
  • kept the launchd PATH hardening from the previous fix (OPENCLAW_BIN + explicit EnvironmentVariables.PATH)
  • current docs checks are passing on fc7424cccb

So at this point the PR should be back to a clean docs-only state and ready for re-review. If there's still anything unclear or too operator-footgun-y in the runbook, I can tighten it further.

@gitshrimp
Copy link
Copy Markdown
Author

@codex review

@gitshrimp
Copy link
Copy Markdown
Author

Docs checks are green and the PR is now cleanly rebased on current main.

If a maintainer has a minute for a quick docs review, this should be ready to merge.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc7424cccb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +107 to +108
if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then
exit 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not gate the watchdog on health --json exit status

This check never catches the "process is alive but unhealthy" case the new section is trying to cover. openclaw health --json only exits non-zero when the gateway RPC itself is unreachable or times out; src/commands/health.ts explicitly treats channel probe failures as non-fatal after a successful RPC. In practice, if the gateway is still answering RPCs but one or more channels are wedged/stale, this if returns success and the watchdog exits before running doctor --fix or gateway restart, so operators get only a reachability watchdog rather than the advertised unhealthy-process recovery layer.

Useful? React with 👍 / 👎.

@vincentkoc vincentkoc added the triage: low-signal-docs Candidate: docs-only change looks low signal; maintainer review needed. label Apr 26, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 28, 2026

Codex review: found issues before merge.

Summary
This PR adds an optional macOS LaunchAgent watchdog recovery runbook and links it from the update rollback docs.

Reproducibility: Do we have a high-confidence way to reproduce the issue? Yes, by static inspection of the PR diff against current main: healthCommand keeps exit success after Gateway RPC reachability, and managed Gateway services preserve profile/config/state variables that the sample LaunchAgent omits.

Next step before merge
The remaining blockers are concrete, but choosing whether to narrow this docs runbook or wait for the broader rescue-watchdog work needs maintainer/product judgment.

Security
Cleared: The PR is documentation-only and does not modify repository code execution, CI, dependencies, lockfiles, workflows, package metadata, or committed executable scripts.

Review findings

  • [P2] Align the watchdog gate with the promised failure mode — docs/platforms/mac/bundled-gateway.md:107-109
  • [P2] Carry the managed service environment into the watchdog — docs/platforms/mac/bundled-gateway.md:133-137
Review details

Best possible solution:

Revise the runbook to match supported contracts: either document a reachability-only LaunchAgent with profile/config/state propagation, or defer this guidance to the canonical rescue-watchdog work.

Do we have a high-confidence way to reproduce the issue?

Do we have a high-confidence way to reproduce the issue? Yes, by static inspection of the PR diff against current main: healthCommand keeps exit success after Gateway RPC reachability, and managed Gateway services preserve profile/config/state variables that the sample LaunchAgent omits.

Is this the best way to solve the issue?

Is this the best way to solve the issue? No, not as currently written; the PATH hardening is useful, but the narrow maintainable path is to align the predicate and environment example with current contracts or wait for the broader rescue-watchdog design.

Full review comments:

  • [P2] Align the watchdog gate with the promised failure mode — docs/platforms/mac/bundled-gateway.md:107-109
    The new section says this covers process-alive-but-unhealthy cases, but openclaw health --json succeeds whenever the Gateway RPC is reachable and reports channel issues as nonfatal. This makes the watchdog skip remediation for degraded-but-responsive gateways; either frame the example as reachability-only or use a predicate that fails for the intended state.
    Confidence: 0.9
  • [P2] Carry the managed service environment into the watchdog — docs/platforms/mac/bundled-gateway.md:133-137
    The LaunchAgent example only forwards PATH, while managed Gateway services preserve OPENCLAW_PROFILE, OPENCLAW_STATE_DIR, and OPENCLAW_CONFIG_PATH. For ai.openclaw.<profile> services, this can make the watchdog probe or restart the default Gateway instead of the intended one.
    Confidence: 0.84

Overall correctness: patch is incorrect
Overall confidence: 0.88

Acceptance criteria:

  • pnpm docs:list
  • pnpm check:docs

What I checked:

  • PR discussion addressed the initial PATH footgun: Greptile flagged bare openclaw under launchd; the author followed up by adding OPENCLAW_BIN and an explicit EnvironmentVariables.PATH in PR head fc7424cccb79f13a687bea18082eeae411837d0a, so the remaining blockers are not that original PATH issue. Public docs: docs/platforms/mac/bundled-gateway.md. (docs/platforms/mac/bundled-gateway.md:93, fc7424cccb79)
  • Current main does not already include the proposed runbook: The macOS gateway doc still ends the gateway smoke check at the existing Related section; targeted search found no system-level recovery watchdog, unhealthy but still running, OPENCLAW_BIN, rescue-watchdog, or rescueWatchdog runbook in current main. Public docs: docs/platforms/mac/bundled-gateway.md. (docs/platforms/mac/bundled-gateway.md:75, 9f2f75ff0293)
  • Health exit status is reachability-based: healthCommand calls the Gateway health RPC, then sets fatal = false with the explicit comment that channel issues are reported but not fatal; a responsive-but-degraded gateway can therefore make openclaw health --json exit successfully. (src/commands/health.ts:562, 9f2f75ff0293)
  • Health docs define nonzero exit narrowly: The user-facing health docs say openclaw health exits nonzero if the gateway is unreachable or the probe fails/timeouts, which does not cover every process-alive-but-unhealthy state described by the PR text. Public docs: docs/gateway/health.md. (docs/gateway/health.md:58, 9f2f75ff0293)
  • Managed Gateway services preserve profile and state/config environment: buildServiceEnvironment forwards OPENCLAW_PROFILE, and the common service environment forwards OPENCLAW_STATE_DIR and OPENCLAW_CONFIG_PATH; the PR's LaunchAgent example only adds PATH, so it can target the default service for profile-specific installs. (src/daemon/service-env.ts:409, 9f2f75ff0293)
  • Profile/config mismatch is an existing service problem: Status output treats CLI and service config divergence as a likely profile/state-dir mismatch and tells users to reinstall from the intended profile or OPENCLAW_STATE_DIR, supporting the need for the watchdog example to propagate those variables. (src/cli/daemon-cli/status.print.ts:141, 9f2f75ff0293)

Likely related people:

  • Peter Steinberger: Recent blame/log evidence ties him to the current macOS gateway docs, health command, service environment, status mismatch messaging, and update docs that define the contracts this PR must match. (role: recent maintainer and adjacent owner; confidence: medium; commits: edb7e00721fd, 3147efbed40b; files: docs/platforms/mac/bundled-gateway.md, docs/install/updating.md, docs/gateway/health.md)

Remaining risk / open question:

  • The documented watchdog can imply process-alive unhealthy recovery while its predicate only catches Gateway RPC unreachable/timeout cases.
  • Profile-specific users can probe, repair, or restart the wrong Gateway if the sample LaunchAgent does not propagate profile/config/state environment.
  • The docs overlap with open rescue-watchdog PRs feat: add external rescue watchdog #40791 and Rescue: add watchdog core service and cron engine #46502, so maintainers should avoid publishing parallel recovery guidance that later conflicts with the supported design.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9f2f75ff0293.

@clawsweeper clawsweeper Bot mentioned this pull request Apr 29, 2026
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: macos App: macos docs Improvements or additions to documentation size: XS triage: low-signal-docs Candidate: docs-only change looks low signal; maintainer review needed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants