[Feature] Graduated crash recovery ladder for gateway

## Problem

When the gateway enters a crash loop (repeated failures within minutes), the current recovery is limited to `KeepAlive: true` in launchd/systemd, which restarts the same process. If the root cause is a bad config, corrupted state, or a version-specific bug, the gateway crash-loops indefinitely with no escalation.

The channel health monitor (`gateway.channelHealthCheckMinutes`) handles channel-level issues but doesn't address gateway-level crash loops.

## Evidence

- Crash loops where the gateway failed repeatedly due to: bad config that passed JSON validation but caused runtime errors, corrupted auth-state.json after interrupted writes, and version-specific bugs after updates.
- In each case, manual intervention was required.

## Workaround

I built `oc-crash-guard`, a ~200-line bash script that runs every 5 minutes via launchd and implements a graduated recovery ladder:

- **Layer 0:** Auth-profile stale block auto-heal (every healthy check)
- **Layer 1:** Config revert to last-known-good (after 3 consecutive health failures)
- **Layer 2:** Gateway reinstall via npm (if config revert didn't help after 2 more checks)
- **Layer 3:** Version rollback to last-good version (if reinstall didn't help)

State is tracked in a JSON file with consecutive failure count and last recovery action. A maintenance mode flag suppresses checks during intentional updates (auto-expires after 15 minutes).

## Proposed Solution

Add a native graduated recovery system:

1. **Track consecutive health failures** in gateway state
2. **Escalate through recovery tiers:**
   - Tier 1: Restart gateway (existing KeepAlive behavior)
   - Tier 2: Revert config to last-good backup
   - Tier 3: Clear potentially corrupted state files (auth-state.json, session cache)
   - Tier 4: Alert user via configured channel that manual intervention is needed
3. **Maintenance mode:** Skip recovery during intentional updates
4. **Configurable thresholds** for escalation

```json
{
  "gateway": {
    "crashRecovery": {
      "enabled": true,
      "escalationThresholds": [3, 5, 8],
      "maintenanceTimeoutMinutes": 15,
      "alertOnEscalation": true
    }
  }
}
```

## Impact

Medium. Prevents extended unattended downtime. Especially valuable for headless/remote deployments (Raspberry Pi, VPS) where SSH access may not be immediate.

## Environment

- OpenClaw 2026.4.10 (npm, macOS)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Graduated crash recovery ladder for gateway #65815

Problem

Evidence

Workaround

Proposed Solution

Impact

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Graduated crash recovery ladder for gateway #65815

Description

Problem

Evidence

Workaround

Proposed Solution

Impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions