Problem
When the gateway enters a crash loop (repeated failures within minutes), the current recovery is limited to KeepAlive: true in launchd/systemd, which restarts the same process. If the root cause is a bad config, corrupted state, or a version-specific bug, the gateway crash-loops indefinitely with no escalation.
The channel health monitor (gateway.channelHealthCheckMinutes) handles channel-level issues but doesn't address gateway-level crash loops.
Evidence
- Crash loops where the gateway failed repeatedly due to: bad config that passed JSON validation but caused runtime errors, corrupted auth-state.json after interrupted writes, and version-specific bugs after updates.
- In each case, manual intervention was required.
Workaround
I built oc-crash-guard, a ~200-line bash script that runs every 5 minutes via launchd and implements a graduated recovery ladder:
- Layer 0: Auth-profile stale block auto-heal (every healthy check)
- Layer 1: Config revert to last-known-good (after 3 consecutive health failures)
- Layer 2: Gateway reinstall via npm (if config revert didn't help after 2 more checks)
- Layer 3: Version rollback to last-good version (if reinstall didn't help)
State is tracked in a JSON file with consecutive failure count and last recovery action. A maintenance mode flag suppresses checks during intentional updates (auto-expires after 15 minutes).
Proposed Solution
Add a native graduated recovery system:
- Track consecutive health failures in gateway state
- Escalate through recovery tiers:
- Tier 1: Restart gateway (existing KeepAlive behavior)
- Tier 2: Revert config to last-good backup
- Tier 3: Clear potentially corrupted state files (auth-state.json, session cache)
- Tier 4: Alert user via configured channel that manual intervention is needed
- Maintenance mode: Skip recovery during intentional updates
- Configurable thresholds for escalation
{
"gateway": {
"crashRecovery": {
"enabled": true,
"escalationThresholds": [3, 5, 8],
"maintenanceTimeoutMinutes": 15,
"alertOnEscalation": true
}
}
}
Impact
Medium. Prevents extended unattended downtime. Especially valuable for headless/remote deployments (Raspberry Pi, VPS) where SSH access may not be immediate.
Environment
- OpenClaw 2026.4.10 (npm, macOS)
Problem
When the gateway enters a crash loop (repeated failures within minutes), the current recovery is limited to
KeepAlive: truein launchd/systemd, which restarts the same process. If the root cause is a bad config, corrupted state, or a version-specific bug, the gateway crash-loops indefinitely with no escalation.The channel health monitor (
gateway.channelHealthCheckMinutes) handles channel-level issues but doesn't address gateway-level crash loops.Evidence
Workaround
I built
oc-crash-guard, a ~200-line bash script that runs every 5 minutes via launchd and implements a graduated recovery ladder:State is tracked in a JSON file with consecutive failure count and last recovery action. A maintenance mode flag suppresses checks during intentional updates (auto-expires after 15 minutes).
Proposed Solution
Add a native graduated recovery system:
{ "gateway": { "crashRecovery": { "enabled": true, "escalationThresholds": [3, 5, 8], "maintenanceTimeoutMinutes": 15, "alertOnEscalation": true } } }Impact
Medium. Prevents extended unattended downtime. Especially valuable for headless/remote deployments (Raspberry Pi, VPS) where SSH access may not be immediate.
Environment