Skip to content

fix(config): serialize async config writes to prevent data loss on startup#205

Open
BingqingLyu wants to merge 14 commits intomainfrom
fork-pr-40464-fix-40410-config-wipe-on-restart
Open

fix(config): serialize async config writes to prevent data loss on startup#205
BingqingLyu wants to merge 14 commits intomainfrom
fork-pr-40464-fix-40410-config-wipe-on-restart

Conversation

@BingqingLyu
Copy link
Copy Markdown
Owner

@BingqingLyu BingqingLyu commented Apr 27, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: When OpenClaw Gateway restarts or encounters a connection error, the configuration file (openclaw.json) can be wiped or truncated down to a minimal skeleton (~10 lines), resulting in data loss.
  • Why it matters: This causes the Gateway to fail startup with a "Gateway start blocked" error and forces users to rebuild their configuration manually.
  • What changed:
    • Introduced a sequential configWriteQueue in src/config/io.ts to strictly serialize config disk writes.
    • Refactored the ownerDisplaySecret auto-persist routine to perform an atomic read-modify-write inside the queue lock, rather than computing a merge patch using a stale config snapshot.
    • Replaced .loadConfig() with .readConfigFileSnapshotForWrite() during secrets persistence to avoid permanently baking ephemeral runtime overrides.
    • Captured createConfigIO() and runtimeConfigSnapshot state before entering the write queue to prevent execution-time path or environment shifts.
    • Preserved retry semantics for auto-secrets by utilizing explicit rejection (throw Error) instead of silently returning on invalid snapshots.
    • Prevented queued writes from reviving cleared runtime snapshots by rigorously re-checking live global states (runtimeConfigSnapshot).
  • What did NOT change (scope boundary): The structure of OpenClawConfig, JSON patching mechanisms (merge patch), and general config parsing behavior remain untouched.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

Users will no longer lose their ~/.openclaw/openclaw.json configuration file when restarting the Gateway or restarting their system. No defaults or config schemas were changed.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux (Ubuntu/Debian) / macOS
  • Runtime/container: Node.js 24.x
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): Any configuration with existing keys (e.g., gateway.port, gateway.auth)

Steps

  1. Wait for Gateway to start with a working, populated OpenClaw configuration file.
  2. Restart the system or restart the OpenClaw Gateway.
  3. Observe the ~/.openclaw/openclaw.json file.

Expected

  • The configuration file properties (such as existing auth configuration and port) remain intact.

Actual

  • Prior to this PR: The configuration file is overwritten with a nearly empty file containing only the ownerDisplaySecret and basic defaults.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

(Note: Added src/config/io.write-config-queue.test.ts to reproduce the race condition exactly. The tests pass with this fix).

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: Ran targeted tests confirming sequential module-level writes do not corrupt configuration.
  • Edge cases checked: Simulated the exact ownerDisplaySecret pattern with concurrent writeConfigFile calls to verify the stale snapshot issue is fully mitigated by the internal queue lock.
  • What you did not verify: I did not manually verify the Windows scheduling behavior, as the fix resides at the platform-agnostic config file I/O layer.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert this specific PR commit.
  • Files/config to restore: Any previously backed up openclaw.json.
  • Known bad symptoms reviewers should watch for: Deadlocks or hangs during early Gateway startup (if the write queue resolves incorrectly).

Risks and Mitigations

  • Risk: The write queue promise (configWriteQueue) could theoretically deadlock if a config write throws a catastrophic unhandled error without resolving.
    • Mitigation: The queue explicitly attaches a .finally() (via .then(success, failure)) to catch any thrown errors and properly advance the queue to the next write operation, ensuring no deadlocks ever occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Config file wiped on Gateway restart

2 participants