Skip to content

fix(config): serialize concurrent writeConfigFile calls with a file lock#612

Open
BingqingLyu wants to merge 4 commits into
mainfrom
fork-pr-44997-fix-config-write-lock
Open

fix(config): serialize concurrent writeConfigFile calls with a file lock#612
BingqingLyu wants to merge 4 commits into
mainfrom
fork-pr-44997-fix-config-write-lock

Conversation

@BingqingLyu

@BingqingLyu BingqingLyu commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Summary

Concurrent openclaw agents add invocations race on the global openclaw.json file. Each process reads the config independently, computes its own updated version, then writes via atomic rename. The rename is safe against torn writes but not against lost-update races: whichever process wins the rename last silently overwrites every change made by the others.

This is the root cause of the config-overwrite symptom reported in openclaw#43367:

Config overwrite: /home/user/.openclaw/openclaw.json (... -> ..., backup=...)

After parallel agents add commands, openclaw agents list shows only a subset of the agents.

Fix

Wrap the entire read-snapshot → compute → write sequence inside withFileLock(configPath, ...) using the existing file-lock primitive already used by the pairing store, auth-profile store, and other subsystems. Concurrent writes now queue on an advisory .lock file and each process re-reads the latest on-disk state before writing, so no agent entry is silently dropped.

Lock options: 15 retries, 1.5× back-off, 50 ms–5 s window, randomised jitter, 30 s stale timeout — tuned to handle typical parallel agent-add bursts without introducing noticeable latency for single-writer cases.

Testing

Run 4+ concurrent openclaw agents add commands targeting the same config file and verify all agents appear in openclaw agents list afterwards. Without this fix the test fails non-deterministically; with it, all agents persist.

Related

Closes openclaw#43367 (partial — this addresses the config-write race; session-lock and OAuth refresh races are separate issues)

Concurrent `openclaw agents add` invocations race on the global
openclaw.json file: each process reads the config independently, computes
its own updated version, then writes it back via an atomic rename.  The
rename is safe against torn writes but not against lost-update races:
whichever process wins the rename last silently overwrites every change
made by the others.  This is the root cause of the config-overwrite
symptom reported in openclaw#43367.

Fix: wrap the entire read-snapshot → compute → write sequence inside
withFileLock(configPath, ...) using the existing file-lock primitive
already used by the pairing store, auth-profile store, and others.
Concurrent writes now queue on an advisory .lock file and each process
re-reads the latest on-disk state before writing, so no agent entry is
silently dropped.

Lock options: 15 retries, 1.5× back-off, 50 ms–5 s window, randomised
jitter, 30 s stale timeout — tuned to handle typical parallel agent-add
bursts without introducing noticeable latency for single-writer cases.

Fixes: openclaw#43367
…it in agents add

Addresses the lost-update race on agents.list identified by @greptile-apps
in the review of this PR.

The previous commit serialised concurrent writes via withFileLock, which
prevents torn-rename races (two processes writing simultaneously).
However, cfg was pre-computed by callers *before* entering the lock.
createMergePatch treats arrays atomically, so when two processes both
derive their configs from the same stale base and add different agents,
the second writer's patch replaces agents.list wholesale, silently
dropping the first writer's entry.

This commit introduces updateConfigFile(transform, options?) — a new
sibling to writeConfigFile that accepts a transformation function instead
of a pre-computed config.  The transform receives the freshest on-disk
config (re-read inside the advisory lock) and returns the desired next
config.  Because the read, transform, and write all happen under the same
lock, every concurrent writer sees the fully-committed result of all
preceding writes before computing its own mutation.

agentsAddCommand (non-interactive path) is updated to use
updateConfigFile.  The agent-creation and binding logic is moved inside
the transform so it always operates on the current agents.list, making
concurrent 'openclaw agents add' commands fully safe with respect to
array-valued config fields.

writeConfigFile is kept as-is for callers that write complete,
self-contained configs (gateway bootstrap, model config, etc.) where
the pre-computed pattern is intentional and correct.
…timePatch, add snapshot invalidation

Two bugs in the module-level updateConfigFile wrapper:

1. The runtimePatch block computed createMergePatch(runtimeConfigSnapshot, transformed)
   where  was derived from the fresh disk state (not from runtimeConfigSnapshot).
   In writeConfigFile this pattern is valid because cfg comes from runtimeConfigSnapshot.
   Here the bases differ so the patch is semantically wrong in gateway mode.

2. After writing, the runtime snapshot was never invalidated, leaving in-memory
   state stale for subsequent loadConfig() callers in gateway mode.

Fix: remove the incorrect runtimePatch overlay (env-var ref restoration is
already handled inside writeConfigFileUnderLock via restoreEnvVarRefs), and
add clearRuntimeConfigSnapshot() when a snapshot was active so subsequent
reads go back to disk.

This only affects gateway mode — the CLI agents add use case runs without a
runtime snapshot and was already correct.
withFileLock calls fs.mkdir without an explicit mode, which creates the
directory with default permissions (0755, world-readable). The subsequent
mkdir in writeConfigFileUnderLock passes mode:0700 but is a no-op on an
already-existing directory, so the first write to a new config location
would leave the config directory accessible to other local users until the
process was restarted — a potential secret-exposure regression.

Fix: add an explicit mkdir(configDir, { recursive:true, mode:0o700 }) in
both writeConfigFile and updateConfigFile before the withFileLock call so
we always win the race to create the directory with hardened permissions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-agent orchestration is unstable: concurrent agents add/config overwrites, session-lock failures, and detached child work

2 participants