Skip to content

[Feature] Graduated crash recovery ladder for gateway #65815

@smonett

Description

@smonett

Problem

When the gateway enters a crash loop (repeated failures within minutes), the current recovery is limited to KeepAlive: true in launchd/systemd, which restarts the same process. If the root cause is a bad config, corrupted state, or a version-specific bug, the gateway crash-loops indefinitely with no escalation.

The channel health monitor (gateway.channelHealthCheckMinutes) handles channel-level issues but doesn't address gateway-level crash loops.

Evidence

  • Crash loops where the gateway failed repeatedly due to: bad config that passed JSON validation but caused runtime errors, corrupted auth-state.json after interrupted writes, and version-specific bugs after updates.
  • In each case, manual intervention was required.

Workaround

I built oc-crash-guard, a ~200-line bash script that runs every 5 minutes via launchd and implements a graduated recovery ladder:

  • Layer 0: Auth-profile stale block auto-heal (every healthy check)
  • Layer 1: Config revert to last-known-good (after 3 consecutive health failures)
  • Layer 2: Gateway reinstall via npm (if config revert didn't help after 2 more checks)
  • Layer 3: Version rollback to last-good version (if reinstall didn't help)

State is tracked in a JSON file with consecutive failure count and last recovery action. A maintenance mode flag suppresses checks during intentional updates (auto-expires after 15 minutes).

Proposed Solution

Add a native graduated recovery system:

  1. Track consecutive health failures in gateway state
  2. Escalate through recovery tiers:
    • Tier 1: Restart gateway (existing KeepAlive behavior)
    • Tier 2: Revert config to last-good backup
    • Tier 3: Clear potentially corrupted state files (auth-state.json, session cache)
    • Tier 4: Alert user via configured channel that manual intervention is needed
  3. Maintenance mode: Skip recovery during intentional updates
  4. Configurable thresholds for escalation
{
  "gateway": {
    "crashRecovery": {
      "enabled": true,
      "escalationThresholds": [3, 5, 8],
      "maintenanceTimeoutMinutes": 15,
      "alertOnEscalation": true
    }
  }
}

Impact

Medium. Prevents extended unattended downtime. Especially valuable for headless/remote deployments (Raspberry Pi, VPS) where SSH access may not be immediate.

Environment

  • OpenClaw 2026.4.10 (npm, macOS)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions