feat: atomic config management with validation and crash-loop rollback

## Problem

When `openclaw.json` is modified with invalid values, the gateway crashes on reload/restart and enters a crash loop. There is no rollback mechanism, no pre-apply validation that catches unresolvable env vars, and no fallback to a known-good state.

**This is the #1 cause of gateway outages** for users who have AI agents modifying their config — which is most OpenClaw users.

### Real Incident (2026-02-16)

1. A sub-agent replaced API keys in `openclaw.json` with `${ENV_VAR}` references (e.g. `${OPENAI_API_KEY}`)
2. The gateway restarted, but the env vars were not in the process environment
3. All API calls failed — keys were literal strings like `"${OPENAI_API_KEY}"`
4. The gateway entered a crash loop requiring manual intervention to restore config
5. In a separate incident, 20 Antfarm agents were added to `agents.list`, hijacking Telegram message routing

The meta-problem: **AI agents modifying config files are the most common cause of crash loops.** Agents don't always understand the runtime environment, and a single bad write can take down the entire gateway with no automated recovery.

## Current State

OpenClaw already has:
- Config validation via Zod schema (`validation.ts`)
- Backup rotation (`backup-rotation.ts` — keeps 5 `.bak` files)
- Env var substitution (`env-substitution.ts`)

What's **missing**:
- Pre-write validation that checks env var references actually resolve
- A "last-known-good" config that's only updated after successful gateway startup
- Crash-loop detection with automatic rollback
- A dry-run/diff mode for config changes

## Proposed Solution

### Phase 1: Last-Known-Good + Crash-Loop Revert (highest impact, simplest)
- On successful gateway startup (healthy for >10s), save config as `openclaw.json.last-known-good`
- On crash loop (>3 crashes in 60s), auto-revert to last-known-good
- Log the failed config for debugging

### Phase 2: Validation + Dry-Run
- `openclaw config diff` — show pending changes
- `openclaw config apply --dry-run` — validate without applying
- Pre-write check that all `${VAR}` references resolve to actual env values
- API key format validation (correct prefixes, not just non-empty)

### Phase 3: Atomic Transactions
- `openclaw config apply --rollback-on-failure` — auto-revert if startup fails within N seconds
- Transactional config changes: fully applied or fully rolled back
- Env var resolution checking at write time

## Why This Matters

Every OpenClaw user running AI agents (sub-agents, Antfarm workflows, cron-based automation) is one bad config write away from a crash loop that requires SSH access to fix. The gateway should be self-healing, not fragile.

The existing `.bak` rotation helps with accidental overwrites but does NOT help with crash loops — all 5 backups can be bad configs if the agent keeps writing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: atomic config management with validation and crash-loop rollback #17700

Problem

Real Incident (2026-02-16)

Current State

Proposed Solution

Phase 1: Last-Known-Good + Crash-Loop Revert (highest impact, simplest)

Phase 2: Validation + Dry-Run

Phase 3: Atomic Transactions

Why This Matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: atomic config management with validation and crash-loop rollback #17700

Description

Problem

Real Incident (2026-02-16)

Current State

Proposed Solution

Phase 1: Last-Known-Good + Crash-Loop Revert (highest impact, simplest)

Phase 2: Validation + Dry-Run

Phase 3: Atomic Transactions

Why This Matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions