fix(commitments): serialize load-modify-save with in-process queue + cross-process file lock#81153
fix(commitments): serialize load-modify-save with in-process queue + cross-process file lock#81153ai-hpc wants to merge 1 commit into
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 24, 2026, 10:04 PM ET / 02:04 UTC. Summary PR surface: Source +153, Tests +61. Total +214 across 3 files. Reproducibility: yes. The linked bug report and proof give a deterministic Promise.all repro, and current main source still has unsynchronized whole-file load-modify-save mutations; I did not execute it in this read-only review. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land this commitments-specific serialization fix after normal maintainer review and keep the linked bug report open until the PR merges. Do we have a high-confidence way to reproduce the issue? Yes. The linked bug report and proof give a deterministic Promise.all repro, and current main source still has unsynchronized whole-file load-modify-save mutations; I did not execute it in this read-only review. Is this the best way to solve the issue? Yes. A commitments-local FIFO queue plus the existing file lock is the narrowest maintainable fix for this store and matches the repository's existing persistent-dedupe serialization pattern. Codex review notes: model gpt-5.5, reasoning high; reviewed against 60c0f249ad4f. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +153, Tests +61. Total +214 across 3 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
1c68d07 to
8bd4741
Compare
|
After-fix real behavior proof attached. Linux 6.8.0-110-generic, Node 22.22.2. Same Headline result
The verification was captured at BEFORE — verbatim Promise.all output on
|
| Command | Result |
|---|---|
pnpm test src/commitments/store.test.ts |
8/8 pass (5 prior + 3 new #13-tagged regression tests) |
pnpm test src/commitments/ |
20/20 pass (store.test.ts, runtime.test.ts, extraction.test.ts, commitments-full-chain.integration.test.ts) |
pnpm test src/infra/heartbeat-runner.commitments.test.ts |
7/7 pass (the acceptance lane clawsweeper called out) |
Cross-process layer
The same-process Promise.all repro can't directly exercise the cross-process withFileLock (gateway daemon + standalone CLI), but the layer is structurally identical to src/plugin-sdk/persistent-dedupe.ts:149: per-path in-process FIFO queue → withFileLock(storePath, ...) → load-modify-save. Default withFileLock options (stale: 60_000, 6 retries with backoff up to 180ms) match the dedupe helper exactly.
Repro command (AFTER)
git fetch https://github.com/ai-hpc/openclaw.git fix/commitments-store-writer-queue
git checkout 1c68d071abbc32698e8c2cdfb6a606ad6efe4a59
export PATH=~/.nvm/versions/node/v22.22.2/bin:$PATH
CI=true pnpm install --prefer-offline
pnpm build
# paste race-repro.mjs from the issue body
node race-repro-after.mjsExpected on 1c68d071ab: 20/20 trials clean. Observed: 20/20 trials clean.
Full BEFORE log (20 trials, all LOSS)
Store path: /home/orin/.openclaw/commitments/commitments.json
Running 20 concurrent-dismiss trials…
trial 1: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 2: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 3: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 4: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 5: LOSS { A: dismissed, B: pending } ← one dismiss silently lost
trial 6: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 7: LOSS { A: dismissed, B: pending } ← one dismiss silently lost
trial 8: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 9: LOSS { A: dismissed, B: pending } ← one dismiss silently lost
trial 10: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 11: LOSS { A: dismissed, B: pending } ← one dismiss silently lost
trial 12: LOSS { A: dismissed, B: pending } ← one dismiss silently lost
trial 13: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 14: LOSS { A: dismissed, B: pending } ← one dismiss silently lost
trial 15: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 16: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 17: LOSS { A: dismissed, B: pending } ← one dismiss silently lost
trial 18: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 19: LOSS { A: pending, B: dismissed } ← one dismiss silently lost
trial 20: LOSS { A: dismissed, B: pending } ← one dismiss silently lost
Result: 20/20 trials lost a write (0/20 clean)
Full AFTER log (20 trials, all clean on PR head 1c68d07)
Store path: /home/orin/.openclaw/commitments/commitments.json
HEAD: PR #81153 (1c68d071ab)
Running 20 concurrent-dismiss trials AFTER PR…
trial 1: OK { A: dismissed, B: dismissed }
trial 2: OK { A: dismissed, B: dismissed }
trial 3: OK { A: dismissed, B: dismissed }
trial 4: OK { A: dismissed, B: dismissed }
trial 5: OK { A: dismissed, B: dismissed }
trial 6: OK { A: dismissed, B: dismissed }
trial 7: OK { A: dismissed, B: dismissed }
trial 8: OK { A: dismissed, B: dismissed }
trial 9: OK { A: dismissed, B: dismissed }
trial 10: OK { A: dismissed, B: dismissed }
trial 11: OK { A: dismissed, B: dismissed }
trial 12: OK { A: dismissed, B: dismissed }
trial 13: OK { A: dismissed, B: dismissed }
trial 14: OK { A: dismissed, B: dismissed }
trial 15: OK { A: dismissed, B: dismissed }
trial 16: OK { A: dismissed, B: dismissed }
trial 17: OK { A: dismissed, B: dismissed }
trial 18: OK { A: dismissed, B: dismissed }
trial 19: OK { A: dismissed, B: dismissed }
trial 20: OK { A: dismissed, B: dismissed }
Result: 0/20 trials lost a write (20/20 clean)
Closes #81145.
Re-review progress:
- State: Complete
- Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
- Run: https://github.com/openclaw/clawsweeper/actions/runs/25762787567
- Updated: 2026-05-12T21:24:36.420Z
…cross-process file lock markCommitmentsAttempted, markCommitmentsStatus, upsertInferredCommitments, and the auto-expire pass in loadCommitmentStoreWithExpiredMarked each did an independent load-modify-save against ~/.openclaw/commitments/commitments.json with no lock. Two callers that overlapped in one event-loop tick — or in two different processes (gateway daemon heartbeat + standalone `openclaw commitments dismiss` CLI) — silently lost whichever save committed first. Verified 20/20 in a Promise.all repro on v2026.5.10-beta.1 (qa-reports/13-commitments-race-no-lock/programmatic-repro.log). User-visible symptoms: openclaw commitments dismiss silently reverts; a sent commitment can lose its sent stamp and get re-delivered; attempts counters can roll back past maxPerDay. Add src/commitments/store-writer.ts wrapping every mutator in runExclusiveCommitmentsStoreWrite, which layers a per-store-path in-process FIFO queue on top of withFileLock (advisory lockfile). The queue serializes same-process writers (CLI + heartbeat Promise.all + the now-queued loadCommitmentStoreWithExpiredMarked); the file lock serializes cross-process writers (gateway daemon vs. CLI), mirroring the same pattern used by src/plugin-sdk/persistent-dedupe.ts. Split the expire pass into an unchecked helper so upsertInferredCommitments can reuse it without re-entering the queue. Three regression tests in store.test.ts assert concurrent dismisses on disjoint ids, repeated attempted bumps, and a dismiss racing an attempted bump all converge — each fails on current main.
8bd4741 to
cb1827c
Compare
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Moonlit Signal Puff Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
@clawsweeper automerge |
|
🦞🔧
Draft PRs stay fix-only until GitHub marks them ready for review. Pause with Automerge progress:
|
|
ClawSweeper 🐠 reef update Thanks for the work here. ClawSweeper could not write to the source branch, so it opened a replacement PR rather than letting the fix drift. attribution still points back here. Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
fish notes: model gpt-5.5, reasoning high; reviewed against 2d84450. |
Summary
markCommitmentsAttempted,markCommitmentsStatus,upsertInferredCommitments, and the auto-expire pass insideloadCommitmentStoreWithExpiredMarkedeach did an independent load-modify-save against~/.openclaw/commitments/commitments.jsonwith no serialization. Two callers that overlapped in one event-loop tick — or that ran in two different processes (gateway daemon heartbeat vs standaloneopenclaw commitments dismissCLI) — silently lost whichever save committed first. A Promise.all repro against the exported store functions loses a write 20/20 times onv2026.5.10-beta.1.openclaw commitments dismisscan revert under a concurrent heartbeat tick; a commitment whosesentstatus was overwritten by a stale snapshot gets re-delivered on the next tick;attemptscounters can roll back past the configuredmaxPerDaycap.src/commitments/store-writer.tsexportingrunExclusiveCommitmentsStoreWrite(storePath, fn). The helper layers a per-store-path in-process FIFO writer queue on top ofwithFileLock(storePath, ...), mirroring the pattern insrc/plugin-sdk/persistent-dedupe.ts(the only other store with the same write-contention class). Every mutator (markCommitmentsAttempted,markCommitmentsStatus,upsertInferredCommitments) and the queued publicloadCommitmentStoreWithExpiredMarkednow run inside that gate. The expire pass is factored into a privateloadAndMarkExpiredUncheckedsoupsertInferredCommitmentscan reuse it without re-entering its own queue slot. The in-process queue serializes same-process writers (CLI dispatch, heartbeatPromise.all, the now-queued auto-expire pass); the file lock serializes cross-process writers (gateway daemon vs CLI).loadCommitmentStore,saveCommitmentStore,resolveCommitmentStorePath, and all other read-only helpers (listPendingCommitmentsForScope,listDueCommitmentsForSession,listDueCommitmentSessionKeys,listCommitments). The first two are still bare load/save primitives used inside the queue; the read helpers now route through the queuedloadCommitmentStoreWithExpiredMarkedso their auto-expire save is also protected, but their return shape and call signatures are unchanged.Promise<void>/Promise<CommitmentRecord[]>exactly as before. Every existing call site atsrc/infra/heartbeat-runner.ts:1568,1702,1730,1780,1826,1929andsrc/commands/commitments.ts:155is untouched.expireStaleCommitmentsInStore/ commitment record shape / store schema. Only the mutation gate changed.src/config/sessions/store-writer.ts. Considered hoisting the queue to a sharedsrc/infra/per-store-writer-queue.ts(the report's Option 2) but kept the helper local tosrc/commitments/to keep the blast radius scoped to one store.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
markCommitmentsStatus/markCommitmentsAttemptedcalls silently lose dismiss/sent/attempt updates (data-loss class; can re-deliver an already-sent commitment) #81145Real behavior proof (required for external PRs)
Behavior or issue addressed: concurrent
markCommitmentsStatus/markCommitmentsAttempted/upsertInferredCommitmentscalls silently lose each other's mutations because the commitments store has no write serialization.Real environment tested: Ubuntu 24.04 (Linux 6.8.0-110-generic), Node 22.22.2, openclaw
v2026.5.10-beta.1(issue-filing commit152ea9af34); patched-branch checks against PR head1c68d071ab.Exact steps or command run after this patch:
pnpm install && pnpm build).pnpm test src/commitments/store.test.ts— the three new#13-tagged regression tests assert that two concurrent disjoint dismisses both persist, that five concurrentmarkCommitmentsAttemptedbumps land all 5, and that adismissracing anattemptedbump converge.pnpm test src/infra/heartbeat-runner.commitments.test.ts— existing heartbeat coverage of the mutator surface continues to pass (acceptance criterion called out by the clawsweeper review on [Bug]: commitments store has no write serialization — concurrentmarkCommitmentsStatus/markCommitmentsAttemptedcalls silently lose dismiss/sent/attempt updates (data-loss class; can re-deliver an already-sent commitment) #81145).qa-reports/13-commitments-race-no-lock/programmatic-repro.log) against the patched build — both dismisses now persist on every trial.Evidence after fix: full live capture posted at fix(commitments): serialize load-modify-save with in-process queue + cross-process file lock #81153 (comment) — same 20-trial
Promise.allscript run BEFORE onv2026.5.10-beta.1(152ea9af34) showed 20/20 trials lost a write; same script run AFTER on PR head1c68d071abshows 0/20 lost, 20/20 clean. Verbatim verdict line:Result: 0/20 trials lost a write (20/20 clean). Vitest:pnpm test src/commitments/store.test.ts→ 8/8 pass (5 prior + 3 new#13regression tests);pnpm test src/commitments/→ 20/20 pass;pnpm test src/infra/heartbeat-runner.commitments.test.ts→ 7/7 pass. Before-fix capture is also in the Before evidence block below.Observed result after fix: every two-writer Promise.all converges to the expected end state.
openclaw commitments dismissinvoked while the gateway daemon's heartbeat is bumpingattemptsno longer silently reverts (cross-process advisory lock wins); two concurrent in-process heartbeat dispatches no longer interleave (in-process queue wins);upsertInferredCommitmentsracing a status update no longer drops the newly-inferred entries.stampConfigVersion-style auto-stamping on the store is unchanged becausesaveCommitmentStoreitself is untouched.What was not tested:
withFileLockoptions matchsrc/plugin-sdk/persistent-dedupe.ts(stale: 60_000, 6 retries with backoff up to 180ms) — a held-too-long lock will be reclaimed; no separate test exercises that branch here.Before evidence (optional but encouraged): Promise.all repro on
v2026.5.10-beta.1(commit152ea9af34), 20 trials, every trial loses one of two disjoint dismisses (full capture inqa-reports/13-commitments-race-no-lock/programmatic-repro.log):Root Cause (if applicable)
src/commitments/store.tsmutators followed the shapeconst store = await loadCommitmentStore(); ...mutate...; await saveCommitmentStore(undefined, store);with no lock or queue.privateFileStore.writeJsonatomically renames the file, so individual writes are never torn — but atomic-rename only protects against partial reads, not against last-writer-wins data loss between two read-modify-save cycles whose reads happen before the first cycle's save commits. Same defect class as the session-store contention that was fixed byrunExclusiveSessionStoreWriteinsrc/config/sessions/store-writer.ts, just on a different store that never got the same retrofit. Cross-process callers (gateway daemon heartbeat vsopenclaw commitments dismissCLI) additionally need an OS-level advisory lock because an in-process queue alone can't serialize between Node processes.markCommitmentsStatusis safe underPromise.all.runExclusiveSessionStoreWritefor some time; the persistent-dedupe helper atsrc/plugin-sdk/persistent-dedupe.ts:149documents the same in-process-queue-plus-file-lock pattern. The commitments store predates that work and was never retrofitted.Regression Test Plan (if applicable)
src/commitments/store.test.tsmarkCommitmentsStatus({status: "dismissed"})calls on disjoint ids both persist.markCommitmentsAttemptedcalls on the same id land all five increments (counter ends at +5, not +1).markCommitmentsStatusdismiss racing amarkCommitmentsAttemptedbump on a different id converges to both mutations applied.heartbeat-runner.commitments.test.tsacceptance criterion is exercised by the existing 7 tests there.#13).User-visible / Behavior Changes
loadCommitmentStoreWithExpiredMarkedhave the same return shape. The only externally observable behavior change is that races between mutators no longer drop writes; correct-single-writer flows are unchanged.Diagram (if applicable)
Security Impact (required)
Yes, explain risk + mitigation: N/ARepro + Verification
Environment
~/.openclaw/commitments/commitments.jsonwith at least two pending commitments. The bug is in the store mutator path, not in the config surface.Steps
fix/commitments-store-writer-queue, head1c68d071ab), runpnpm install && pnpm build.markCommitmentsStatus({status: "dismissed"})calls on disjoint ids viaPromise.all.dismissed.Expected
dismissed. No write is lost.markCommitmentsAttemptedon the same id leavesattempts === 5, notattempts === 1.Actual
src/commitments/store.test.tslock this in.Evidence
src/commitments/store.test.tstagged(#13)plus the BEFORE/AFTER Promise.all comparison in the evidence comment.Human Verification (required)
pnpm test src/commitments/store.test.ts— 8/8 pass on the patched branch (5 prior + 3 new#13regression tests).pnpm test src/commitments/— 20/20 pass acrossstore.test.ts,runtime.test.ts,extraction.test.ts,commitments-full-chain.integration.test.ts.pnpm test src/infra/heartbeat-runner.commitments.test.ts— 7/7 pass (the acceptance lane the clawsweeper review on [Bug]: commitments store has no write serialization — concurrentmarkCommitmentsStatus/markCommitmentsAttemptedcalls silently lose dismiss/sent/attempt updates (data-loss class; can re-deliver an already-sent commitment) #81145 called out).loadCommitmentStore() → mutate → saveCommitmentStore(...)site is now wrapped inrunExclusiveCommitmentsStoreWrite(resolveCommitmentStorePath(), async () => { ... }). The uncheckedloadAndMarkExpiredUncheckedhelper is the only callsite that does not queue — and it is called only from inside an already-queued task inupsertInferredCommitments, avoiding re-entry deadlock.upsertInferredCommitmentsthroughloadAndMarkExpiredUnchecked(no inner queue) rather than the queuedloadCommitmentStoreWithExpiredMarked.ensureCommitmentsStoreDir(storePath)runsfs.mkdir({recursive: true})before lock acquisition so the advisory lockfile creation does not ENOENT.clearCommitmentsStoreWriterQueuesForTestexported for future tests that need to drain the queue between cases (not used yet; tests pass without it because each test scopesOPENCLAW_STATE_DIRto a fresh tmp dir).src/plugin-sdk/persistent-dedupe.ts's pattern which is exercised in production.stale: 60_000ms. The default options matchpersistent-dedupe.ts(reclaim after 60s, 6 retries with backoff up to 180ms). Same trade-offs as the dedupe helper.Review Conversations
Compatibility / Migration
Risks and Mitigations
persistent-dedupe.tspolicy.withFileLockreclaims locks whose owning pid is dead or whose lockfile is older than 60s. Worst case is one mutator stall, not a permanent deadlock.