Skip to content

Config recovery can silently restore stale last-good snapshots during reload/version-skew windows #71289

@100yenadmin

Description

@100yenadmin

Summary

Config recovery/reload can silently revert valid user config to stale last-good snapshots during mixed-version or plugin-schema validation windows, causing loss of settings that directly affect agent behavior and user trust.

This surfaced in production-like single-tenant deployments after host/package upgrades and config reloads. The failure mode is public-facing because it can revert user-visible behavior settings (for example reasoning/thinking mode, speed/fast mode, plugin config shape, model routing, and other config-backed behavior) without an explicit user action and without a clean explanation in the UI.

I believe this is at least a P1, and arguably P0 for trust/integrity, because the system can appear to accept config changes, then later silently snap back to an older state via internal recovery logic.

Impact

Observed impact includes silent or semi-silent reversion of behavior-affecting settings after reload/restart windows, including examples like:

  • thinking / reasoning mode not sticking
  • fast mode not sticking
  • newer plugin config fields disappearing
  • valid newer config being replaced by an older "last good" snapshot
  • plugin min-host-version checks causing config recovery instead of clean quarantine/isolation

Even when the system logs warnings, the user-visible effect is "settings changed themselves". That is trust-damaging.

Why this is critical

This is not just a validation error. The more serious issue is that internal recovery can restore stale config in a way that makes the system appear nondeterministic.

From the operator/user perspective:

  1. config is written
  2. config may validate successfully in direct CLI validation
  3. gateway reload/restart happens
  4. internal recovery decides config is invalid in a different path/context
  5. openclaw.json is auto-restored from openclaw.json.last-good
  6. behavior changes unexpectedly

That can reset core behavior knobs and create the appearance that the assistant is "drifting" or ignoring instructions, when the underlying cause is config recovery.

Observed failure classes

We observed two distinct invalidation classes that both lead into config recovery:

1) Plugin config schema rejects newer fields

Example logged error:

Invalid config at /root/.openclaw/openclaw.json:
- plugins.entries.lossless-claw.config: invalid config: must NOT have additional properties
- plugins.entries.lossless-claw.config.cacheAwareCompaction: invalid config: must NOT have additional properties

This led to startup recovery:

  • .clobbered.* created
  • openclaw.json.last-good restored

2) Mixed-version/plugin-manifest window during reload

Example logged error:

plugins.entries.feishu: plugin feishu: plugin requires OpenClaw >=2026.4.23, but this host is 2026.4.22; skipping load
plugins.entries.whatsapp: plugin whatsapp: plugin requires OpenClaw >=2026.4.23, but this host is 2026.4.22; skipping load

This happened after package files had been upgraded but the running process/reload path still behaved as the older host version.

This led to reload recovery:

Config auto-restored from last-known-good: /root/.openclaw/openclaw.json (reload-invalid-config)
[reload] config reload restored last-known-good config after invalid-config

So a mixed old-process/new-files state appears sufficient to trigger rollback.

Reproduction shape (sanitized)

A sanitized reproduction outline:

  1. Start from a host with valid config A and openclaw.json.last-good aligned to A.
  2. Introduce valid newer config B that includes newer plugin config fields and/or plugins requiring a newer host version.
  3. Upgrade package files or plugin files.
  4. During a reload/startup window, let validation occur in a runtime context that still sees older host/plugin schema state.
  5. Observe:
    • config.observe with gateway-run-invalid-config or reload-invalid-config
    • .clobbered.* creation
    • automatic restore from openclaw.json.last-good
    • config file content snapping back to older shape A

Key detail: direct CLI validation and runtime reload validation may not agree under this window, which makes the system especially confusing to operators.

Concrete symptoms observed

  • config-health.json continued to point lastKnownGood / lastPromotedGood at an older simpler config snapshot
  • gateway logs showed both startup and reload recovery paths
  • direct validation on a fully-upgraded host could succeed, while earlier reload path had already reverted the file
  • behavior-affecting settings were effectively not durable even after writes appeared to succeed

Relevant code paths

These appear to be the main code families involved:

  • src/gateway/server-startup-config.ts
  • src/gateway/config-reload.ts
  • src/gateway/server-reload-handlers.ts
  • src/config/io.ts
  • src/plugins/manifest-registry.ts

The especially important behavior appears to be:

  • plugin-aware validation during config write / reload
  • recovery via recoverConfigFromLastKnownGood(...)
  • promotion/retention of lastKnownGood / lastPromotedGood
  • reload path treating plugin min-host-version incompatibility as config-invalid and restoring stale config

Suspected root cause

Likely combination of:

  1. plugin-aware config validation bound to the currently running host/plugin manifest set
  2. recovery path restoring stale last-good snapshots on invalidation
  3. mixed-version window where upgraded files and older running process disagree
  4. invalidation semantics that are too broad: plugin compatibility failure causes full config recovery instead of isolating the incompatible plugin entries

What should happen instead

Safer behavior would be something like:

  • Do not roll back the entire config file because a subset of plugin entries are invalid for the currently running host version.
  • Quarantine or skip the incompatible plugin entries, but preserve the rest of the config.
  • Make lastKnownGood promotion rules stricter and more observable.
  • Avoid using stale last-good snapshots as broad rollback targets when invalidation is caused by version skew.
  • Make reload-time validation and direct CLI validation consistent, or at least clearly explain mismatch conditions.
  • Expose a clear operator-visible reason when recovery occurs and what exact keys were reverted.

Why upstream should prioritize this

This kind of issue undermines trust in configuration durability. It is worse than a loud failure because it can create the impression that:

  • settings randomly revert
  • the assistant ignores user/operator instructions
  • model/behavior tuning is unreliable
  • upgrades destabilize behavior even when config changes are valid

In agent systems, that kind of silent behavior regression is especially damaging.

Suggested fixes

Potential mitigation directions:

  1. Narrow recovery scope:

    • plugin entry invalid -> disable/quarantine that plugin entry only
    • do not rewrite whole config from last-good
  2. Version-skew guard:

    • if package version on disk and running host version differ, suppress config recovery and require explicit restart before promotion/recovery logic runs
  3. Recovery transparency:

    • emit a machine-readable recovery event with the exact config paths reverted
    • include previous hash, restored hash, and invalidating issues
  4. Last-good hygiene:

    • do not keep promoting old snapshots when newer writes are pending validation across version transitions
    • keep version-scoped recovery baselines
  5. Stronger tests:

    • mixed-version reload tests
    • plugin min-host-version incompatibility during reload
    • schema evolution tests for plugin config blocks
    • ensure behavior-affecting settings are not lost on unrelated plugin validation failures

Evidence snippets (sanitized)

Config auto-restored from last-known-good: /path/to/openclaw.json (reload-invalid-config)
[reload] config reload restored last-known-good config after invalid-config
Invalid config at /path/to/openclaw.json:
- plugins.entries.<plugin>.config: invalid config: must NOT have additional properties
plugin requires OpenClaw >=2026.4.23, but this host is 2026.4.22; skipping load

Ask

Please triage this as a high-severity config integrity/recovery regression. I’m happy to provide more details privately if needed, but I’m keeping this report sanitized because the deployments involved are customer-facing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions