Skip to content

feat: atomic config management with validation and crash-loop rollback #17700

@aronchick

Description

@aronchick

Problem

When openclaw.json is modified with invalid values, the gateway crashes on reload/restart and enters a crash loop. There is no rollback mechanism, no pre-apply validation that catches unresolvable env vars, and no fallback to a known-good state.

This is the #1 cause of gateway outages for users who have AI agents modifying their config — which is most OpenClaw users.

Real Incident (2026-02-16)

  1. A sub-agent replaced API keys in openclaw.json with ${ENV_VAR} references (e.g. ${OPENAI_API_KEY})
  2. The gateway restarted, but the env vars were not in the process environment
  3. All API calls failed — keys were literal strings like "${OPENAI_API_KEY}"
  4. The gateway entered a crash loop requiring manual intervention to restore config
  5. In a separate incident, 20 Antfarm agents were added to agents.list, hijacking Telegram message routing

The meta-problem: AI agents modifying config files are the most common cause of crash loops. Agents don't always understand the runtime environment, and a single bad write can take down the entire gateway with no automated recovery.

Current State

OpenClaw already has:

  • Config validation via Zod schema (validation.ts)
  • Backup rotation (backup-rotation.ts — keeps 5 .bak files)
  • Env var substitution (env-substitution.ts)

What's missing:

  • Pre-write validation that checks env var references actually resolve
  • A "last-known-good" config that's only updated after successful gateway startup
  • Crash-loop detection with automatic rollback
  • A dry-run/diff mode for config changes

Proposed Solution

Phase 1: Last-Known-Good + Crash-Loop Revert (highest impact, simplest)

  • On successful gateway startup (healthy for >10s), save config as openclaw.json.last-known-good
  • On crash loop (>3 crashes in 60s), auto-revert to last-known-good
  • Log the failed config for debugging

Phase 2: Validation + Dry-Run

  • openclaw config diff — show pending changes
  • openclaw config apply --dry-run — validate without applying
  • Pre-write check that all ${VAR} references resolve to actual env values
  • API key format validation (correct prefixes, not just non-empty)

Phase 3: Atomic Transactions

  • openclaw config apply --rollback-on-failure — auto-revert if startup fails within N seconds
  • Transactional config changes: fully applied or fully rolled back
  • Env var resolution checking at write time

Why This Matters

Every OpenClaw user running AI agents (sub-agents, Antfarm workflows, cron-based automation) is one bad config write away from a crash loop that requires SSH access to fix. The gateway should be self-healing, not fragile.

The existing .bak rotation helps with accidental overwrites but does NOT help with crash loops — all 5 backups can be bad configs if the agent keeps writing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions