Skip to content

fix(commitments): serialize load-modify-save with in-process queue + cross-process file lock#81153

Closed
ai-hpc wants to merge 1 commit into
openclaw:mainfrom
ai-hpc:fix/commitments-store-writer-queue
Closed

fix(commitments): serialize load-modify-save with in-process queue + cross-process file lock#81153
ai-hpc wants to merge 1 commit into
openclaw:mainfrom
ai-hpc:fix/commitments-store-writer-queue

Conversation

@ai-hpc

@ai-hpc ai-hpc commented May 12, 2026

Copy link
Copy Markdown
Member

Summary

  • Problem: markCommitmentsAttempted, markCommitmentsStatus, upsertInferredCommitments, and the auto-expire pass inside loadCommitmentStoreWithExpiredMarked each did an independent load-modify-save against ~/.openclaw/commitments/commitments.json with no serialization. Two callers that overlapped in one event-loop tick — or that ran in two different processes (gateway daemon heartbeat vs standalone openclaw commitments dismiss CLI) — silently lost whichever save committed first. A Promise.all repro against the exported store functions loses a write 20/20 times on v2026.5.10-beta.1.
  • Why it matters: silent data-loss class. openclaw commitments dismiss can revert under a concurrent heartbeat tick; a commitment whose sent status was overwritten by a stale snapshot gets re-delivered on the next tick; attempts counters can roll back past the configured maxPerDay cap.
  • What changed: new src/commitments/store-writer.ts exporting runExclusiveCommitmentsStoreWrite(storePath, fn). The helper layers a per-store-path in-process FIFO writer queue on top of withFileLock(storePath, ...), mirroring the pattern in src/plugin-sdk/persistent-dedupe.ts (the only other store with the same write-contention class). Every mutator (markCommitmentsAttempted, markCommitmentsStatus, upsertInferredCommitments) and the queued public loadCommitmentStoreWithExpiredMarked now run inside that gate. The expire pass is factored into a private loadAndMarkExpiredUnchecked so upsertInferredCommitments can reuse it without re-entering its own queue slot. The in-process queue serializes same-process writers (CLI dispatch, heartbeat Promise.all, the now-queued auto-expire pass); the file lock serializes cross-process writers (gateway daemon vs CLI).
  • What did NOT change (scope boundary):
    • loadCommitmentStore, saveCommitmentStore, resolveCommitmentStorePath, and all other read-only helpers (listPendingCommitmentsForScope, listDueCommitmentsForSession, listDueCommitmentSessionKeys, listCommitments). The first two are still bare load/save primitives used inside the queue; the read helpers now route through the queued loadCommitmentStoreWithExpiredMarked so their auto-expire save is also protected, but their return shape and call signatures are unchanged.
    • Mutator signatures. All three mutators still return Promise<void> / Promise<CommitmentRecord[]> exactly as before. Every existing call site at src/infra/heartbeat-runner.ts:1568,1702,1730,1780,1826,1929 and src/commands/commitments.ts:155 is untouched.
    • expireStaleCommitmentsInStore / commitment record shape / store schema. Only the mutation gate changed.
    • The session-store writer queue at src/config/sessions/store-writer.ts. Considered hoisting the queue to a shared src/infra/per-store-writer-queue.ts (the report's Option 2) but kept the helper local to src/commitments/ to keep the blast radius scoped to one store.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: concurrent markCommitmentsStatus / markCommitmentsAttempted / upsertInferredCommitments calls silently lose each other's mutations because the commitments store has no write serialization.

  • Real environment tested: Ubuntu 24.04 (Linux 6.8.0-110-generic), Node 22.22.2, openclaw v2026.5.10-beta.1 (issue-filing commit 152ea9af34); patched-branch checks against PR head 1c68d071ab.

  • Exact steps or command run after this patch:

    1. Apply this PR locally and build (pnpm install && pnpm build).
    2. Run pnpm test src/commitments/store.test.ts — the three new #13-tagged regression tests assert that two concurrent disjoint dismisses both persist, that five concurrent markCommitmentsAttempted bumps land all 5, and that a dismiss racing an attempted bump converge.
    3. Run pnpm test src/infra/heartbeat-runner.commitments.test.ts — existing heartbeat coverage of the mutator surface continues to pass (acceptance criterion called out by the clawsweeper review on [Bug]: commitments store has no write serialization — concurrent markCommitmentsStatus / markCommitmentsAttempted calls silently lose dismiss/sent/attempt updates (data-loss class; can re-deliver an already-sent commitment) #81145).
    4. Run the Promise.all repro from the issue body (saved at qa-reports/13-commitments-race-no-lock/programmatic-repro.log) against the patched build — both dismisses now persist on every trial.
  • Evidence after fix: full live capture posted at fix(commitments): serialize load-modify-save with in-process queue + cross-process file lock #81153 (comment) — same 20-trial Promise.all script run BEFORE on v2026.5.10-beta.1 (152ea9af34) showed 20/20 trials lost a write; same script run AFTER on PR head 1c68d071ab shows 0/20 lost, 20/20 clean. Verbatim verdict line: Result: 0/20 trials lost a write (20/20 clean). Vitest: pnpm test src/commitments/store.test.ts → 8/8 pass (5 prior + 3 new #13 regression tests); pnpm test src/commitments/ → 20/20 pass; pnpm test src/infra/heartbeat-runner.commitments.test.ts → 7/7 pass. Before-fix capture is also in the Before evidence block below.

  • Observed result after fix: every two-writer Promise.all converges to the expected end state. openclaw commitments dismiss invoked while the gateway daemon's heartbeat is bumping attempts no longer silently reverts (cross-process advisory lock wins); two concurrent in-process heartbeat dispatches no longer interleave (in-process queue wins); upsertInferredCommitments racing a status update no longer drops the newly-inferred entries. stampConfigVersion-style auto-stamping on the store is unchanged because saveCommitmentStore itself is untouched.

  • What was not tested:

    • The session-store and cron-store mutation paths. They are out of scope; this PR only touches commitments.
    • Cross-process behavior under a stuck advisory lock older than the 60s stale window. The default withFileLock options match src/plugin-sdk/persistent-dedupe.ts (stale: 60_000, 6 retries with backoff up to 180ms) — a held-too-long lock will be reclaimed; no separate test exercises that branch here.
  • Before evidence (optional but encouraged): Promise.all repro on v2026.5.10-beta.1 (commit 152ea9af34), 20 trials, every trial loses one of two disjoint dismisses (full capture in qa-reports/13-commitments-race-no-lock/programmatic-repro.log):

    trial 1: A=pending B=dismissed
    trial 2: A=pending B=dismissed
    trial 3: A=pending B=dismissed
    …
    trial 20: A=dismissed B=pending
    
    20/20 trials lost a write
    

Root Cause (if applicable)

  • Root cause: src/commitments/store.ts mutators followed the shape const store = await loadCommitmentStore(); ...mutate...; await saveCommitmentStore(undefined, store); with no lock or queue. privateFileStore.writeJson atomically renames the file, so individual writes are never torn — but atomic-rename only protects against partial reads, not against last-writer-wins data loss between two read-modify-save cycles whose reads happen before the first cycle's save commits. Same defect class as the session-store contention that was fixed by runExclusiveSessionStoreWrite in src/config/sessions/store-writer.ts, just on a different store that never got the same retrofit. Cross-process callers (gateway daemon heartbeat vs openclaw commitments dismiss CLI) additionally need an OS-level advisory lock because an in-process queue alone can't serialize between Node processes.
  • Missing detection / guardrail: existing commitments tests cover enablement, delivery limits, expiry, legacy-field cleanup, and listing — none exercise concurrent writers. No contract test in the repo asserts that markCommitmentsStatus is safe under Promise.all.
  • Contributing context: the session-store has had runExclusiveSessionStoreWrite for some time; the persistent-dedupe helper at src/plugin-sdk/persistent-dedupe.ts:149 documents the same in-process-queue-plus-file-lock pattern. The commitments store predates that work and was never retrofitted.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
  • Target test or file: src/commitments/store.test.ts
  • Scenario the test should lock in:
    • Two concurrent markCommitmentsStatus({status: "dismissed"}) calls on disjoint ids both persist.
    • Five concurrent markCommitmentsAttempted calls on the same id land all five increments (counter ends at +5, not +1).
    • A markCommitmentsStatus dismiss racing a markCommitmentsAttempted bump on a different id converges to both mutations applied.
  • Why this is the smallest reliable guardrail: it pins the only contract the new gate is supposed to enforce (concurrent same-process writers serialize correctly) against the exact code path the bug-report repro hit. The bot's heartbeat-runner.commitments.test.ts acceptance criterion is exercised by the existing 7 tests there.
  • Existing test that already covers this: none — the existing commitments tests cover correctness of single-writer behavior, not concurrent writers.
  • If no new test is added, why not: N/A; three regression tests are added in this PR (all tagged #13).

User-visible / Behavior Changes

  • No external API change. The three mutators and the queued loadCommitmentStoreWithExpiredMarked have the same return shape. The only externally observable behavior change is that races between mutators no longer drop writes; correct-single-writer flows are unchanged.

Diagram (if applicable)

Before:
caller A ─→ load(S0) ─→ mutate(A) ─→ save(S1_A) ─────→ disk: {A:dismissed, B:pending}
caller B ─────→ load(S0) ─→ mutate(B) ─→ save(S1_B) ─→ disk: {A:pending,   B:dismissed}   ← A's dismiss is gone

After (single process):
caller A → [queue slot 1] → withFileLock → load → mutate(A) → save → release ─→ disk: {A:dismissed, B:pending}
caller B → [queue slot 2] ───────────────────── waits ──────────────────────── → withFileLock → load → mutate(B) → save → disk: {A:dismissed, B:dismissed}

After (two processes, e.g. gateway daemon + CLI):
proc A → [queue A] → withFileLock(commitments.json) → load → mutate → save → release lock ─→ disk: {A:dismissed, B:pending}
proc B → [queue B] ──────────── advisory lock retries (8ms..180ms, ×6) ──────── → load → mutate → save → disk: {A:dismissed, B:dismissed}

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Ubuntu 24.04 (Linux 6.8.0-110-generic)
  • Runtime/container: Node 22.22.2
  • Model/provider: Not applicable
  • Integration/channel (if any): Not applicable
  • Relevant config (redacted): any ~/.openclaw/commitments/commitments.json with at least two pending commitments. The bug is in the store mutator path, not in the config surface.

Steps

  1. On a fresh checkout of this branch (fix/commitments-store-writer-queue, head 1c68d071ab), run pnpm install && pnpm build.
  2. Seed a temp commitments store with two pending commitments.
  3. Issue two concurrent markCommitmentsStatus({status: "dismissed"}) calls on disjoint ids via Promise.all.
  4. Read the store back; both should be dismissed.

Expected

  • Both commitments end as dismissed. No write is lost.
  • Repeating five concurrent markCommitmentsAttempted on the same id leaves attempts === 5, not attempts === 1.

Actual

  • Before this PR: 20/20 trials lose one of the two dismisses (see Before evidence above).
  • After this PR: each trial converges to the expected state; the three new regression tests in src/commitments/store.test.ts lock this in.

Evidence

  • Failing test/log before + passing after — three new vitest assertions in src/commitments/store.test.ts tagged (#13) plus the BEFORE/AFTER Promise.all comparison in the evidence comment.
  • Trace/log snippets — full verbatim BEFORE (20/20 LOSS) and AFTER (0/20 LOSS, 20/20 OK) captures attached at fix(commitments): serialize load-modify-save with in-process queue + cross-process file lock #81153 (comment).
  • Screenshot/recording — N/A; live terminal capture is the appropriate medium for this defect and is attached above.
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:
  • Edge cases checked:
    • Re-entry into the queue is avoided by routing upsertInferredCommitments through loadAndMarkExpiredUnchecked (no inner queue) rather than the queued loadCommitmentStoreWithExpiredMarked.
    • First-ever mutator on a fresh state dir: ensureCommitmentsStoreDir(storePath) runs fs.mkdir({recursive: true}) before lock acquisition so the advisory lockfile creation does not ENOENT.
    • Test cleanup: clearCommitmentsStoreWriterQueuesForTest exported for future tests that need to drain the queue between cases (not used yet; tests pass without it because each test scopes OPENCLAW_STATE_DIR to a fresh tmp dir).
  • What I did not verify:
    • Cross-process race behavior under a real gateway daemon + CLI two-process repro on this host; the in-test repro covers same-process, and the cross-process layer is structurally identical to src/plugin-sdk/persistent-dedupe.ts's pattern which is exercised in production.
    • Behavior when the advisory lockfile is held longer than stale: 60_000 ms. The default options match persistent-dedupe.ts (reclaim after 60s, 6 retries with backoff up to 180ms). Same trade-offs as the dedupe helper.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes (mutator signatures unchanged; only the internal serialization gate changes).
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: a held advisory lock from a crashed prior process could delay a subsequent mutator by up to the stale window (60s).
    • Mitigation: matches the existing persistent-dedupe.ts policy. withFileLock reclaims locks whose owning pid is dead or whose lockfile is older than 60s. Worst case is one mutator stall, not a permanent deadlock.
  • Risk: layered queue-plus-lock latency adds overhead vs the prior bare load-save.
    • Mitigation: in the contention-free path the queue is a single resolved promise and the file lock acquires immediately. Steady-state cost is one extra fs operation per mutation. Heartbeat tick budget is well above this.

@openclaw-barnacle openclaw-barnacle Bot added size: M proof: supplied External PR includes structured after-fix real behavior proof. labels May 12, 2026
@clawsweeper

clawsweeper Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 24, 2026, 10:04 PM ET / 02:04 UTC.

Summary
The branch adds a commitments-store writer queue/file-lock helper, wraps commitments mutators and auto-expiry saves with it, and adds concurrency regression tests.

PR surface: Source +153, Tests +61. Total +214 across 3 files.

Reproducibility: yes. The linked bug report and proof give a deterministic Promise.all repro, and current main source still has unsynchronized whole-file load-modify-save mutations; I did not execute it in this read-only review.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none

Risk before merge

  • The PR does not include a direct two-process gateway-plus-CLI run; the same-process repro, existing lock implementation, and matching persistent-dedupe pattern cover the core lost-update class.

Maintainer options:

  1. Decide the mitigation before merge
    Land this commitments-specific serialization fix after normal maintainer review and keep the linked bug report open until the PR merges.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge
No automated repair is needed; the remaining action is normal maintainer review, optional cross-process proof judgment, and landing if maintainers accept the patch.

Security
Cleared: The diff touches local commitments storage and tests only, with no dependency, workflow, network, secret, permission, or command-execution surface change.

Review details

Best possible solution:

Land this commitments-specific serialization fix after normal maintainer review and keep the linked bug report open until the PR merges.

Do we have a high-confidence way to reproduce the issue?

Yes. The linked bug report and proof give a deterministic Promise.all repro, and current main source still has unsynchronized whole-file load-modify-save mutations; I did not execute it in this read-only review.

Is this the best way to solve the issue?

Yes. A commitments-local FIFO queue plus the existing file lock is the narrowest maintainable fix for this store and matches the repository's existing persistent-dedupe serialization pattern.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 60c0f249ad4f.

Label changes

Label changes:

  • add P1: The PR fixes a real silent persistence-loss path that can revert dismiss/sent/attempt updates and re-deliver commitments.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The contributor supplied before/after live terminal output showing 20/20 lost writes before and 0/20 after, plus targeted test results on the patched branch.

Label justifications:

  • P1: The PR fixes a real silent persistence-loss path that can revert dismiss/sent/attempt updates and re-deliver commitments.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The contributor supplied before/after live terminal output showing 20/20 lost writes before and 0/20 after, plus targeted test results on the patched branch.
  • proof: sufficient: Contributor real behavior proof is sufficient. The contributor supplied before/after live terminal output showing 20/20 lost writes before and 0/20 after, plus targeted test results on the patched branch.
Evidence reviewed

PR surface:

Source +153, Tests +61. Total +214 across 3 files.

View PR surface stats
Area Files Added Removed Net
Source 2 229 76 +153
Tests 1 61 0 +61
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 3 290 76 +214

What I checked:

  • Current main still has unsynchronized commitments writes: Current main reads, mutates, and saves the whole commitments store in upsertInferredCommitments, markCommitmentsAttempted, and markCommitmentsStatus without a queue or file lock, matching the linked lost-update report. (src/commitments/store.ts:365, 60c0f249ad4f)
  • PR adds a per-store-path writer gate: The proposed helper serializes queued tasks per store path, ensures the commitments directory exists, and runs each task under withFileLock with the same retry/stale tuning as the existing dedupe pattern. (src/commitments/store-writer.ts:102, b240beeebe77)
  • PR wraps the mutating commitments paths: The virtual merge routes auto-expiry, inferred upserts, attempts bumps, and status changes through runExclusiveCommitmentsStoreWrite(resolveCommitmentStorePath(), ...) while keeping the public function signatures unchanged. (src/commitments/store.ts:394, b240beeebe77)
  • Regression tests cover the same-process race class: The PR adds tests for concurrent disjoint dismisses, concurrent attempt increments, and a dismiss racing an attempt bump, which directly exercise the stale-snapshot overwrite class from the report. (src/commitments/store.test.ts:238, b240beeebe77)
  • Dependency contract supports queue plus lock: withFileLock is process-local re-entrant, and persistent-dedupe documents the same reason a per-file in-process queue is still needed around whole-file read-modify-write cycles. (src/plugin-sdk/persistent-dedupe.ts:157, 60c0f249ad4f)
  • File-lock implementation inspected: The lock implementation acquires a .lock sidecar with stale recovery, retries, and allowReentrant: true, so the PR's same-process FIFO queue is the part that prevents re-entrant same-process stale reads. (src/plugin-sdk/file-lock.ts:77, 60c0f249ad4f)

Likely related people:

  • vignesh07: GitHub path history shows vignesh07 authored the inferred commitments store and follow-up safety/coverage commits that introduced and shaped the current commitments surface. (role: feature-history owner; confidence: high; commits: 8e4035d09a88, 95bf450dc973, aecde2b3ac8b; files: src/commitments/store.ts, src/commitments/store.test.ts)
  • vincentkoc: Recent GitHub path history and local shallow blame show vincentkoc touched current commitments-store validation and current checkout lines in the affected file. (role: recent area contributor; confidence: medium; commits: ba48d162af5a, e4332f7cff2b; files: src/commitments/store.ts, src/commitments/store.test.ts)
  • steipete: GitHub path history shows steipete recently changed adjacent state-store serialization patterns, including the session-store writer queue and plugin dedupe/file-lock surfaces that this PR mirrors. (role: adjacent state-store contributor; confidence: medium; commits: b4437047f477, 59807efa3126, 694ca50e9775; files: src/config/sessions/store-writer.ts, src/plugin-sdk/persistent-dedupe.ts, src/plugin-sdk/file-lock.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@ai-hpc ai-hpc force-pushed the fix/commitments-store-writer-queue branch from 1c68d07 to 8bd4741 Compare May 12, 2026 20:45
@ai-hpc

ai-hpc commented May 12, 2026

Copy link
Copy Markdown
Member Author

After-fix real behavior proof attached. Linux 6.8.0-110-generic, Node 22.22.2. Same race-repro.mjs Promise.all script from the issue body, against the same local store path, run immediately before and after the patch.

Headline result

Trials Lost a write Clean
BEFORE (v2026.5.10-beta.1, HEAD 152ea9af34) 20 20/20 0/20
AFTER (PR head 1c68d071ab) 20 0/20 20/20 ✓

The verification was captured at 1c68d071ab; the current PR head 8bd474165f is a pure comment-trim amend of the same commit (10-line file-header reduction in store-writer.ts, no functional change), so the AFTER capture still applies.

BEFORE — verbatim Promise.all output on v2026.5.10-beta.1

Store path: /home/orin/.openclaw/commitments/commitments.json
Running 20 concurrent-dismiss trials…

trial 1: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 2: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
…
trial 20: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost

Result: 20/20 trials lost a write (0/20 clean)

AFTER — verbatim Promise.all output on PR head 1c68d071ab

Store path: /home/orin/.openclaw/commitments/commitments.json
HEAD: PR #81153 (1c68d071ab)
Running 20 concurrent-dismiss trials AFTER PR…

trial 1: OK   { A: dismissed, B: dismissed }
trial 2: OK   { A: dismissed, B: dismissed }
…
trial 20: OK   { A: dismissed, B: dismissed }

Result: 0/20 trials lost a write (20/20 clean)

Vitest run on PR head

Command Result
pnpm test src/commitments/store.test.ts 8/8 pass (5 prior + 3 new #13-tagged regression tests)
pnpm test src/commitments/ 20/20 pass (store.test.ts, runtime.test.ts, extraction.test.ts, commitments-full-chain.integration.test.ts)
pnpm test src/infra/heartbeat-runner.commitments.test.ts 7/7 pass (the acceptance lane clawsweeper called out)

Cross-process layer

The same-process Promise.all repro can't directly exercise the cross-process withFileLock (gateway daemon + standalone CLI), but the layer is structurally identical to src/plugin-sdk/persistent-dedupe.ts:149: per-path in-process FIFO queue → withFileLock(storePath, ...) → load-modify-save. Default withFileLock options (stale: 60_000, 6 retries with backoff up to 180ms) match the dedupe helper exactly.

Repro command (AFTER)

git fetch https://github.com/ai-hpc/openclaw.git fix/commitments-store-writer-queue
git checkout 1c68d071abbc32698e8c2cdfb6a606ad6efe4a59
export PATH=~/.nvm/versions/node/v22.22.2/bin:$PATH
CI=true pnpm install --prefer-offline
pnpm build
# paste race-repro.mjs from the issue body
node race-repro-after.mjs

Expected on 1c68d071ab: 20/20 trials clean. Observed: 20/20 trials clean.

Full BEFORE log (20 trials, all LOSS)
Store path: /home/orin/.openclaw/commitments/commitments.json
Running 20 concurrent-dismiss trials…

trial 1: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 2: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 3: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 4: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 5: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost
trial 6: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 7: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost
trial 8: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 9: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost
trial 10: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 11: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost
trial 12: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost
trial 13: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 14: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost
trial 15: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 16: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 17: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost
trial 18: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 19: LOSS { A: pending, B: dismissed }  ← one dismiss silently lost
trial 20: LOSS { A: dismissed, B: pending }  ← one dismiss silently lost

Result: 20/20 trials lost a write (0/20 clean)
Full AFTER log (20 trials, all clean on PR head 1c68d07)
Store path: /home/orin/.openclaw/commitments/commitments.json
HEAD: PR #81153 (1c68d071ab)
Running 20 concurrent-dismiss trials AFTER PR…

trial 1: OK   { A: dismissed, B: dismissed }
trial 2: OK   { A: dismissed, B: dismissed }
trial 3: OK   { A: dismissed, B: dismissed }
trial 4: OK   { A: dismissed, B: dismissed }
trial 5: OK   { A: dismissed, B: dismissed }
trial 6: OK   { A: dismissed, B: dismissed }
trial 7: OK   { A: dismissed, B: dismissed }
trial 8: OK   { A: dismissed, B: dismissed }
trial 9: OK   { A: dismissed, B: dismissed }
trial 10: OK   { A: dismissed, B: dismissed }
trial 11: OK   { A: dismissed, B: dismissed }
trial 12: OK   { A: dismissed, B: dismissed }
trial 13: OK   { A: dismissed, B: dismissed }
trial 14: OK   { A: dismissed, B: dismissed }
trial 15: OK   { A: dismissed, B: dismissed }
trial 16: OK   { A: dismissed, B: dismissed }
trial 17: OK   { A: dismissed, B: dismissed }
trial 18: OK   { A: dismissed, B: dismissed }
trial 19: OK   { A: dismissed, B: dismissed }
trial 20: OK   { A: dismissed, B: dismissed }

Result: 0/20 trials lost a write (20/20 clean)

Closes #81145.

Re-review progress:

…cross-process file lock

markCommitmentsAttempted, markCommitmentsStatus, upsertInferredCommitments, and the auto-expire pass in loadCommitmentStoreWithExpiredMarked each did an independent load-modify-save against ~/.openclaw/commitments/commitments.json with no lock. Two callers that overlapped in one event-loop tick — or in two different processes (gateway daemon heartbeat + standalone `openclaw commitments dismiss` CLI) — silently lost whichever save committed first. Verified 20/20 in a Promise.all repro on v2026.5.10-beta.1 (qa-reports/13-commitments-race-no-lock/programmatic-repro.log). User-visible symptoms: openclaw commitments dismiss silently reverts; a sent commitment can lose its sent stamp and get re-delivered; attempts counters can roll back past maxPerDay. Add src/commitments/store-writer.ts wrapping every mutator in runExclusiveCommitmentsStoreWrite, which layers a per-store-path in-process FIFO queue on top of withFileLock (advisory lockfile). The queue serializes same-process writers (CLI + heartbeat Promise.all + the now-queued loadCommitmentStoreWithExpiredMarked); the file lock serializes cross-process writers (gateway daemon vs. CLI), mirroring the same pattern used by src/plugin-sdk/persistent-dedupe.ts. Split the expire pass into an unchecked helper so upsertInferredCommitments can reuse it without re-entering the queue. Three regression tests in store.test.ts assert concurrent dismisses on disjoint ids, repeated attempted bumps, and a dismiss racing an attempted bump all converge — each fails on current main.
@ai-hpc ai-hpc force-pushed the fix/commitments-store-writer-queue branch from 8bd4741 to cb1827c Compare May 12, 2026 21:17
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@ai-hpc

ai-hpc commented May 25, 2026

Copy link
Copy Markdown
Member Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Moonlit Signal Puff

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: finds missing screenshots.
Image traits: location branch lighthouse; accessory tiny test log scroll; palette moonlit blue and soft silver; mood focused; pose nestled inside a glowing shell; shell woven fiber shell; lighting subtle sparkle highlights; background tiny artifact crates.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Moonlit Signal Puff in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞🔧
ClawSweeper automerge is enabled.

Draft PRs stay fix-only until GitHub marks them ready for review. Pause with /clawsweeper stop.

Automerge progress:

  • 2026-05-25 04:16:20 UTC review queued cb1827c9ee71 (queued)

@clawsweeper clawsweeper Bot added the clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge label May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper 🐠 reef update

Thanks for the work here. ClawSweeper could not write to the source branch, so it opened a replacement PR rather than letting the fix drift. attribution still points back here.

Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
Replacement PR: #86326
Why close: this run explicitly closes the superseded source PR after the credited replacement PR is open, so review continues in one place.
This closeout is intentional for this run: the replacement PR is now the active review lane.
Contributor credit is copied into the replacement PR notes and changelog path.
Co-author credit kept:

fish notes: model gpt-5.5, reasoning high; reviewed against 2d84450.

@clawsweeper clawsweeper Bot closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

2 participants