fix(config): serialize concurrent writeConfigFile calls with a file lock#612
Open
BingqingLyu wants to merge 4 commits into
Open
fix(config): serialize concurrent writeConfigFile calls with a file lock#612BingqingLyu wants to merge 4 commits into
BingqingLyu wants to merge 4 commits into
Conversation
Concurrent `openclaw agents add` invocations race on the global openclaw.json file: each process reads the config independently, computes its own updated version, then writes it back via an atomic rename. The rename is safe against torn writes but not against lost-update races: whichever process wins the rename last silently overwrites every change made by the others. This is the root cause of the config-overwrite symptom reported in openclaw#43367. Fix: wrap the entire read-snapshot → compute → write sequence inside withFileLock(configPath, ...) using the existing file-lock primitive already used by the pairing store, auth-profile store, and others. Concurrent writes now queue on an advisory .lock file and each process re-reads the latest on-disk state before writing, so no agent entry is silently dropped. Lock options: 15 retries, 1.5× back-off, 50 ms–5 s window, randomised jitter, 30 s stale timeout — tuned to handle typical parallel agent-add bursts without introducing noticeable latency for single-writer cases. Fixes: openclaw#43367
…it in agents add Addresses the lost-update race on agents.list identified by @greptile-apps in the review of this PR. The previous commit serialised concurrent writes via withFileLock, which prevents torn-rename races (two processes writing simultaneously). However, cfg was pre-computed by callers *before* entering the lock. createMergePatch treats arrays atomically, so when two processes both derive their configs from the same stale base and add different agents, the second writer's patch replaces agents.list wholesale, silently dropping the first writer's entry. This commit introduces updateConfigFile(transform, options?) — a new sibling to writeConfigFile that accepts a transformation function instead of a pre-computed config. The transform receives the freshest on-disk config (re-read inside the advisory lock) and returns the desired next config. Because the read, transform, and write all happen under the same lock, every concurrent writer sees the fully-committed result of all preceding writes before computing its own mutation. agentsAddCommand (non-interactive path) is updated to use updateConfigFile. The agent-creation and binding logic is moved inside the transform so it always operates on the current agents.list, making concurrent 'openclaw agents add' commands fully safe with respect to array-valued config fields. writeConfigFile is kept as-is for callers that write complete, self-contained configs (gateway bootstrap, model config, etc.) where the pre-computed pattern is intentional and correct.
…timePatch, add snapshot invalidation Two bugs in the module-level updateConfigFile wrapper: 1. The runtimePatch block computed createMergePatch(runtimeConfigSnapshot, transformed) where was derived from the fresh disk state (not from runtimeConfigSnapshot). In writeConfigFile this pattern is valid because cfg comes from runtimeConfigSnapshot. Here the bases differ so the patch is semantically wrong in gateway mode. 2. After writing, the runtime snapshot was never invalidated, leaving in-memory state stale for subsequent loadConfig() callers in gateway mode. Fix: remove the incorrect runtimePatch overlay (env-var ref restoration is already handled inside writeConfigFileUnderLock via restoreEnvVarRefs), and add clearRuntimeConfigSnapshot() when a snapshot was active so subsequent reads go back to disk. This only affects gateway mode — the CLI agents add use case runs without a runtime snapshot and was already correct.
withFileLock calls fs.mkdir without an explicit mode, which creates the
directory with default permissions (0755, world-readable). The subsequent
mkdir in writeConfigFileUnderLock passes mode:0700 but is a no-op on an
already-existing directory, so the first write to a new config location
would leave the config directory accessible to other local users until the
process was restarted — a potential secret-exposure regression.
Fix: add an explicit mkdir(configDir, { recursive:true, mode:0o700 }) in
both writeConfigFile and updateConfigFile before the withFileLock call so
we always win the race to create the directory with hardened permissions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Concurrent
openclaw agents addinvocations race on the globalopenclaw.jsonfile. Each process reads the config independently, computes its own updated version, then writes via atomic rename. The rename is safe against torn writes but not against lost-update races: whichever process wins the rename last silently overwrites every change made by the others.This is the root cause of the config-overwrite symptom reported in openclaw#43367:
After parallel
agents addcommands,openclaw agents listshows only a subset of the agents.Fix
Wrap the entire read-snapshot → compute → write sequence inside
withFileLock(configPath, ...)using the existing file-lock primitive already used by the pairing store, auth-profile store, and other subsystems. Concurrent writes now queue on an advisory.lockfile and each process re-reads the latest on-disk state before writing, so no agent entry is silently dropped.Lock options: 15 retries, 1.5× back-off, 50 ms–5 s window, randomised jitter, 30 s stale timeout — tuned to handle typical parallel agent-add bursts without introducing noticeable latency for single-writer cases.
Testing
Run 4+ concurrent
openclaw agents addcommands targeting the same config file and verify all agents appear inopenclaw agents listafterwards. Without this fix the test fails non-deterministically; with it, all agents persist.Related
Closes openclaw#43367 (partial — this addresses the config-write race; session-lock and OAuth refresh races are separate issues)