Problem
When the OpenClaw gateway crashes or enters a restart loop due to a bad config change, there is currently no built-in recovery mechanism. The user has to:
- Notice the gateway is down
- Manually diagnose the cause
- Restore a previous config by hand
- Restart manually
This is especially painful on headless servers where the user is interacting via mobile (Telegram/WhatsApp).
Proposed solution
Three small, composable features:
1. Restart countdown notifications
Before any gateway restart, send a channel notification with a countdown:
🔄 Gateway restart — T-60s
🔄 Gateway restart — T-30s
🔄 Gateway restart — T-10s
🔄 Gateway restart — T-0 🚀
✅ Gateway up — agents: Research · CRM · Site Seller · System · General
2. Automatic config backup
Before every restart, snapshot openclaw.json to a rotating backup directory (keep last N). Tag each as good-config-<timestamp>.json. On crash-loop detection, automatically roll back to the last known good config.
3. Crash-loop watchdog
A short-lived watchdog (runs ~90s post-restart, then exits). If ≥3 restarts occur within the window:
- Save the bad config for diagnostics
- Restore last known good config automatically
- Restart the gateway
- Notify the user what happened and which config was restored
Reference implementation
We built this as shell scripts that work well in production on a headless Hetzner VPS (Ubuntu 24.04). Key insight: the watchdog needs to be short-lived (not a daemon) and the restart needs to be scheduled with a small delay (via at) so the exec session can return before the kill happens — otherwise you kill the session running the restart command.
Happy to share the scripts as a starting point for a proper implementation.
Why it matters
For always-on, mobile-first setups (the core OpenClaw use case) this is table stakes. The gateway should be self-healing — not something the user has to babysit from their phone.
Problem
When the OpenClaw gateway crashes or enters a restart loop due to a bad config change, there is currently no built-in recovery mechanism. The user has to:
This is especially painful on headless servers where the user is interacting via mobile (Telegram/WhatsApp).
Proposed solution
Three small, composable features:
1. Restart countdown notifications
Before any gateway restart, send a channel notification with a countdown:
2. Automatic config backup
Before every restart, snapshot
openclaw.jsonto a rotating backup directory (keep last N). Tag each asgood-config-<timestamp>.json. On crash-loop detection, automatically roll back to the last known good config.3. Crash-loop watchdog
A short-lived watchdog (runs ~90s post-restart, then exits). If ≥3 restarts occur within the window:
Reference implementation
We built this as shell scripts that work well in production on a headless Hetzner VPS (Ubuntu 24.04). Key insight: the watchdog needs to be short-lived (not a daemon) and the restart needs to be scheduled with a small delay (via
at) so the exec session can return before the kill happens — otherwise you kill the session running the restart command.Happy to share the scripts as a starting point for a proper implementation.
Why it matters
For always-on, mobile-first setups (the core OpenClaw use case) this is table stakes. The gateway should be self-healing — not something the user has to babysit from their phone.