Problem
When openclaw.json is modified with invalid values, the gateway crashes on reload/restart and enters a crash loop. There is no rollback mechanism, no pre-apply validation that catches unresolvable env vars, and no fallback to a known-good state.
This is the #1 cause of gateway outages for users who have AI agents modifying their config — which is most OpenClaw users.
Real Incident (2026-02-16)
- A sub-agent replaced API keys in
openclaw.json with ${ENV_VAR} references (e.g. ${OPENAI_API_KEY})
- The gateway restarted, but the env vars were not in the process environment
- All API calls failed — keys were literal strings like
"${OPENAI_API_KEY}"
- The gateway entered a crash loop requiring manual intervention to restore config
- In a separate incident, 20 Antfarm agents were added to
agents.list, hijacking Telegram message routing
The meta-problem: AI agents modifying config files are the most common cause of crash loops. Agents don't always understand the runtime environment, and a single bad write can take down the entire gateway with no automated recovery.
Current State
OpenClaw already has:
- Config validation via Zod schema (
validation.ts)
- Backup rotation (
backup-rotation.ts — keeps 5 .bak files)
- Env var substitution (
env-substitution.ts)
What's missing:
- Pre-write validation that checks env var references actually resolve
- A "last-known-good" config that's only updated after successful gateway startup
- Crash-loop detection with automatic rollback
- A dry-run/diff mode for config changes
Proposed Solution
Phase 1: Last-Known-Good + Crash-Loop Revert (highest impact, simplest)
- On successful gateway startup (healthy for >10s), save config as
openclaw.json.last-known-good
- On crash loop (>3 crashes in 60s), auto-revert to last-known-good
- Log the failed config for debugging
Phase 2: Validation + Dry-Run
openclaw config diff — show pending changes
openclaw config apply --dry-run — validate without applying
- Pre-write check that all
${VAR} references resolve to actual env values
- API key format validation (correct prefixes, not just non-empty)
Phase 3: Atomic Transactions
openclaw config apply --rollback-on-failure — auto-revert if startup fails within N seconds
- Transactional config changes: fully applied or fully rolled back
- Env var resolution checking at write time
Why This Matters
Every OpenClaw user running AI agents (sub-agents, Antfarm workflows, cron-based automation) is one bad config write away from a crash loop that requires SSH access to fix. The gateway should be self-healing, not fragile.
The existing .bak rotation helps with accidental overwrites but does NOT help with crash loops — all 5 backups can be bad configs if the agent keeps writing.
Problem
When
openclaw.jsonis modified with invalid values, the gateway crashes on reload/restart and enters a crash loop. There is no rollback mechanism, no pre-apply validation that catches unresolvable env vars, and no fallback to a known-good state.This is the #1 cause of gateway outages for users who have AI agents modifying their config — which is most OpenClaw users.
Real Incident (2026-02-16)
openclaw.jsonwith${ENV_VAR}references (e.g.${OPENAI_API_KEY})"${OPENAI_API_KEY}"agents.list, hijacking Telegram message routingThe meta-problem: AI agents modifying config files are the most common cause of crash loops. Agents don't always understand the runtime environment, and a single bad write can take down the entire gateway with no automated recovery.
Current State
OpenClaw already has:
validation.ts)backup-rotation.ts— keeps 5.bakfiles)env-substitution.ts)What's missing:
Proposed Solution
Phase 1: Last-Known-Good + Crash-Loop Revert (highest impact, simplest)
openclaw.json.last-known-goodPhase 2: Validation + Dry-Run
openclaw config diff— show pending changesopenclaw config apply --dry-run— validate without applying${VAR}references resolve to actual env valuesPhase 3: Atomic Transactions
openclaw config apply --rollback-on-failure— auto-revert if startup fails within N secondsWhy This Matters
Every OpenClaw user running AI agents (sub-agents, Antfarm workflows, cron-based automation) is one bad config write away from a crash loop that requires SSH access to fix. The gateway should be self-healing, not fragile.
The existing
.bakrotation helps with accidental overwrites but does NOT help with crash loops — all 5 backups can be bad configs if the agent keeps writing.