docs: add optional macOS gateway watchdog recovery runbook (AI-assisted)#44441
docs: add optional macOS gateway watchdog recovery runbook (AI-assisted)#44441gitshrimp wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR adds optional documentation for an operator-managed macOS launchd watchdog that monitors Gateway health and attempts recovery when the process is alive but unhealthy. The scope is correctly limited to docs, clearly framed as optional, and the recovery order ( Key findings:
Confidence Score: 3/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: docs/platforms/mac/bundled-gateway.md
Line: 93-102
Comment:
**`openclaw` may not be on launchd's PATH**
When launchd executes this script, its default `PATH` is very restricted (typically `/usr/bin:/bin:/usr/sbin:/sbin`). npm global binaries are usually installed in locations like `/usr/local/bin`, `~/.npm-global/bin`, or inside an nvm/nodenv shim directory — none of which are in launchd's default PATH.
As written, every `openclaw` call will produce a "command not found" error (exit code 127). Because the remediation lines all end in `|| true`, those failures are silently swallowed. The watchdog will appear to run successfully (launchd shows it as active) but will do absolutely nothing, leaving operators with a false sense of security.
The fix is to either resolve the binary path explicitly in the script, or inject it via the plist's `EnvironmentVariables`:
**Option A – resolve path at the top of the script:**
```suggestion
#!/bin/zsh
set -euo pipefail
# Resolve openclaw binary — adjust if you installed via nvm, nodenv, pnpm, etc.
OPENCLAW_BIN="${OPENCLAW_BIN:-/usr/local/bin/openclaw}"
if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then
exit 0
fi
"$OPENCLAW_BIN" doctor >/dev/null 2>&1 || true
"$OPENCLAW_BIN" doctor --fix >/dev/null 2>&1 || true
"$OPENCLAW_BIN" gateway restart >/dev/null 2>&1 || true
```
**Option B – extend PATH inside the plist** (add before `StartInterval`):
```xml
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
```
At minimum the docs should warn readers to verify (e.g. `which openclaw`) and substitute the correct absolute path for their install.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: fe7ef93 |
| #!/bin/zsh | ||
| set -euo pipefail | ||
|
|
||
| if openclaw health --json >/dev/null 2>&1; then | ||
| exit 0 | ||
| fi | ||
|
|
||
| openclaw doctor >/dev/null 2>&1 || true | ||
| openclaw doctor --fix >/dev/null 2>&1 || true | ||
| openclaw gateway restart >/dev/null 2>&1 || true |
There was a problem hiding this comment.
openclaw may not be on launchd's PATH
When launchd executes this script, its default PATH is very restricted (typically /usr/bin:/bin:/usr/sbin:/sbin). npm global binaries are usually installed in locations like /usr/local/bin, ~/.npm-global/bin, or inside an nvm/nodenv shim directory — none of which are in launchd's default PATH.
As written, every openclaw call will produce a "command not found" error (exit code 127). Because the remediation lines all end in || true, those failures are silently swallowed. The watchdog will appear to run successfully (launchd shows it as active) but will do absolutely nothing, leaving operators with a false sense of security.
The fix is to either resolve the binary path explicitly in the script, or inject it via the plist's EnvironmentVariables:
Option A – resolve path at the top of the script:
| #!/bin/zsh | |
| set -euo pipefail | |
| if openclaw health --json >/dev/null 2>&1; then | |
| exit 0 | |
| fi | |
| openclaw doctor >/dev/null 2>&1 || true | |
| openclaw doctor --fix >/dev/null 2>&1 || true | |
| openclaw gateway restart >/dev/null 2>&1 || true | |
| #!/bin/zsh | |
| set -euo pipefail | |
| # Resolve openclaw binary — adjust if you installed via nvm, nodenv, pnpm, etc. | |
| OPENCLAW_BIN="${OPENCLAW_BIN:-/usr/local/bin/openclaw}" | |
| if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then | |
| exit 0 | |
| fi | |
| "$OPENCLAW_BIN" doctor >/dev/null 2>&1 || true | |
| "$OPENCLAW_BIN" doctor --fix >/dev/null 2>&1 || true | |
| "$OPENCLAW_BIN" gateway restart >/dev/null 2>&1 || true |
Option B – extend PATH inside the plist (add before StartInterval):
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>At minimum the docs should warn readers to verify (e.g. which openclaw) and substitute the correct absolute path for their install.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/platforms/mac/bundled-gateway.md
Line: 93-102
Comment:
**`openclaw` may not be on launchd's PATH**
When launchd executes this script, its default `PATH` is very restricted (typically `/usr/bin:/bin:/usr/sbin:/sbin`). npm global binaries are usually installed in locations like `/usr/local/bin`, `~/.npm-global/bin`, or inside an nvm/nodenv shim directory — none of which are in launchd's default PATH.
As written, every `openclaw` call will produce a "command not found" error (exit code 127). Because the remediation lines all end in `|| true`, those failures are silently swallowed. The watchdog will appear to run successfully (launchd shows it as active) but will do absolutely nothing, leaving operators with a false sense of security.
The fix is to either resolve the binary path explicitly in the script, or inject it via the plist's `EnvironmentVariables`:
**Option A – resolve path at the top of the script:**
```suggestion
#!/bin/zsh
set -euo pipefail
# Resolve openclaw binary — adjust if you installed via nvm, nodenv, pnpm, etc.
OPENCLAW_BIN="${OPENCLAW_BIN:-/usr/local/bin/openclaw}"
if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then
exit 0
fi
"$OPENCLAW_BIN" doctor >/dev/null 2>&1 || true
"$OPENCLAW_BIN" doctor --fix >/dev/null 2>&1 || true
"$OPENCLAW_BIN" gateway restart >/dev/null 2>&1 || true
```
**Option B – extend PATH inside the plist** (add before `StartInterval`):
```xml
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
```
At minimum the docs should warn readers to verify (e.g. `which openclaw`) and substitute the correct absolute path for their install.
How can I resolve this? If you propose a fix, please make it concise.fe7ef93 to
500930c
Compare
|
Follow-up after local validation:
So this PR is intentionally not presenting the watchdog as a cross-platform default; it documents an optional operator-managed runbook for a narrower macOS local-service case. |
|
Good catch — launchd's default PATH was the real footgun here, and the original example could indeed fail quietly while still looking "installed". I updated the docs to make that failure mode much less likely:
So the example no longer depends on launchd inheriting a friendly shell PATH, which was the weak spot in the first draft. Thanks — this was a real issue, not just wording. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eabc7cf636
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # (for example nvm, pnpm, Homebrew on Apple Silicon, or a custom prefix). | ||
| OPENCLAW_BIN="${OPENCLAW_BIN:-/opt/homebrew/bin/openclaw}" | ||
|
|
||
| if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then |
There was a problem hiding this comment.
Use a failing health predicate in watchdog example
The watchdog gate uses openclaw health --json, but that command treats successful RPC reachability as success and does not fail on degraded channel state, so a gateway that is still responsive but unhealthy will exit 0 and skip the recovery block. In practice this means the documented “alive but unhealthy” watchdog path won’t run for common degraded-but-responsive cases, only for hard failures like unreachable/timeouts.
Useful? React with 👍 / 👎.
eabc7cf to
fc7424c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc7424cccb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| <key>EnvironmentVariables</key> | ||
| <dict> | ||
| <key>PATH</key> | ||
| <string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string> |
There was a problem hiding this comment.
Preserve profile env in the watchdog LaunchAgent example
This LaunchAgent only forwards PATH, so the watchdog's openclaw health / doctor / gateway restart calls will always use the default config and service. That breaks the non-default-profile setups this page already acknowledges (ai.openclaw.<profile> earlier in the file): the watchdog will probe or restart the wrong gateway, and src/cli/daemon-cli/status.print.ts:129-138 already treats profile/state-dir mismatches as a real service mismatch. Please show OPENCLAW_PROFILE (and, if relevant, OPENCLAW_CONFIG_PATH) propagation here so the example targets the intended service.
Useful? React with 👍 / 👎.
|
Quick follow-up after the latest push:
So at this point the PR should be back to a clean docs-only state and ready for re-review. If there's still anything unclear or too operator-footgun-y in the runbook, I can tighten it further. |
|
@codex review |
|
Docs checks are green and the PR is now cleanly rebased on current If a maintainer has a minute for a quick docs review, this should be ready to merge. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc7424cccb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if "$OPENCLAW_BIN" health --json >/dev/null 2>&1; then | ||
| exit 0 |
There was a problem hiding this comment.
Do not gate the watchdog on
health --json exit status
This check never catches the "process is alive but unhealthy" case the new section is trying to cover. openclaw health --json only exits non-zero when the gateway RPC itself is unreachable or times out; src/commands/health.ts explicitly treats channel probe failures as non-fatal after a successful RPC. In practice, if the gateway is still answering RPCs but one or more channels are wedged/stale, this if returns success and the watchdog exits before running doctor --fix or gateway restart, so operators get only a reachability watchdog rather than the advertised unhealthy-process recovery layer.
Useful? React with 👍 / 👎.
|
Codex review: found issues before merge. Summary Reproducibility: Do we have a high-confidence way to reproduce the issue? Yes, by static inspection of the PR diff against current main: Next step before merge Security Review findings
Review detailsBest possible solution: Revise the runbook to match supported contracts: either document a reachability-only LaunchAgent with profile/config/state propagation, or defer this guidance to the canonical rescue-watchdog work. Do we have a high-confidence way to reproduce the issue? Do we have a high-confidence way to reproduce the issue? Yes, by static inspection of the PR diff against current main: Is this the best way to solve the issue? Is this the best way to solve the issue? No, not as currently written; the PATH hardening is useful, but the narrow maintainable path is to align the predicate and environment example with current contracts or wait for the broader rescue-watchdog design. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 9f2f75ff0293. |
Summary
Why
OpenClaw already relies on launchd/systemd for normal crash restart behavior. This PR does not change product behavior. It adds docs for advanced operators who want a separate, conservative recovery layer for cases where the Gateway process is still present but no longer healthy.
Scope / guardrails
openclaw doctor->openclaw doctor --fix->openclaw gateway restartTesting
AI assistance
This PR was AI-assisted. I reviewed and edited the final text before opening.