fix(agents): file-scoped prompt-window guard for same-session embedded races#86067
fix(agents): file-scoped prompt-window guard for same-session embedded races#86067ubehera wants to merge 2 commits into
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 25, 2026, 8:17 PM ET / 00:17 UTC. Summary PR surface: Source +128, Tests +267. Total +395 across 3 files. Reproducibility: yes. The PR body provides a concrete before/after harness for two real controllers on one session JSONL, and the current source path still matches the unguarded release/reacquire race. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the focused guard after maintainer acceptance of the availability tradeoff, current-main merge proof, and coordination with the separate same-lane session-race fixes. Do we have a high-confidence way to reproduce the issue? Yes. The PR body provides a concrete before/after harness for two real controllers on one session JSONL, and the current source path still matches the unguarded release/reacquire race. Is this the best way to solve the issue? Yes for the cross-lane same-file race. The patch serializes only real prompt windows by resolved session file and keeps compaction idle waits out of that queue; the separate same-lane race remains tracked by #86584. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against ea2496b00c5c. Label changesLabel justifications:
Evidence reviewedPR surface: Source +128, Tests +267. Total +395 across 3 files. View PR surface stats
Acceptance criteria:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 💎 rare Tiny Crabkin Hatch commandComment Hatchability rules:
Rarity: 💎 rare. What is this egg doing here?
|
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper re-review PR body updated with verbatim runtime output from the patched controller — Scenario A (same-file serialisation) and Scenario B (cleanup-on-failure when reacquireAfterPrompt throws |
|
🦞👀 Command router queued. I will update this comment with the next step. Re-review progress:
|
|
Added a before/after reproduction to the Evidence section. Same
The patch saves lane B by serialising its prompt window behind lane A's turn — visible in the timestamps: lane B's |
|
@clawsweeper re-review The PR body's Evidence section now leads with a pre-patch vs post-patch reproduction run — same script ( Lane A's takeover is the unavoidable case (external mutation during its prompt window — the fence correctly fires as the safety net). Lane B is the cascade casualty the patch saves by serialising B's release behind A's turn. |
|
🦞👀 Command router queued. I will update this comment with the next step. Re-review progress:
|
|
The 1. Normal success path —
2.
3.
4.
5. Provider stream throws — the wrapped 6. Abort signal fires mid-prompt — same path as case 5; the abort propagates as a thrown error, the wrapping 7. Process exits mid-prompt — the entire The one residual scenario the patch cannot defend against is a provider stream that hangs indefinitely without the abort signal ever firing —
This isn't a new exposure: pre-patch, an indefinitely-hung The cleanup paths are exhaustively tested:
And the same scenarios run end-to-end against real |
|
@clawsweeper re-review Added commit Existing tests updated for the new event-order marker; one new test ( |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Force-pushed What the bot caught: The rebase that removed the kesslerio drain commit from this branch also stripped unrelated infrastructure — Restored:
New regression (the bot-required it("dispose-after-releaseForPrompt vacates the prompt turn so the next same-file controller does not hang", async () => {
await controllerA.releaseForPrompt();
await controllerA.dispose(); // teardown without reacquireAfterPrompt
await controllerA.dispose(); // idempotent
// Controller B on same file must NOT hang on A's vacated turn.
await expect(Promise.race([
controllerB.releaseForPrompt().then(() => "released"),
new Promise<string>(r => setTimeout(() => r("timeout"), 200)),
])).resolves.toBe("released");
});80 tests green (was 74 before; +2 restored + 1 new). @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…d races Two embedded-runner lanes on the same OpenClaw process (typically a heartbeat and a channel reply, or a fresh stuck-session recovery run whose UUID aliases to an active session file) can both release their session write lock for provider streaming, both stride into the model call, and have the takeover fence fire mid-stream on whichever reacquires second. The losing lane's reply is dropped; the user sees 'Embedded agent failed before reply: All models failed'. Add a file-scoped prompt-window queue (module-level Map keyed by the resolved session JSONL path, alongside the existing ownedSessionFileWrites / trustedSessionFileStates singletons) that serialises prompt windows on the same file. The second entrant yields its OS lock, waits for the first lane's turn to vacate, reacquires, then proceeds with its own release/prompt/reacquire cycle. Critically, every exit path vacates the turn: the catch in performHeldLockReleaseForWindow handles errors before the controller takes ownership of the turn, and the finally in reacquireAfterPrompt handles errors after. Lock-timeout, takeover-fence trip, mid-flight release failure, or no-op early returns all leave the queue clean for later same-file waiters. Without the cleanup path, one wedged prompt would block every later prompt on that session file for the rest of the process. The post-prompt compaction-wait release now uses the new releaseForSessionIdleWait(), which goes through the same lock-release machinery but skips queue registration — compaction-wait isn't a provider-prompt window and must not block other lanes' real prompts. Tests: 5 new cases covering same-file serialisation, both cleanup-on-failure paths (reacquire throws + release throws), compaction-wait isolation, and cross-file isolation. Existing multi-controller tests get a reset hook between the first and second controllers' releaseForPrompt calls so their assertions hold against the new queue without forcing them to invent reacquire calls they don't need. afterEach now clears the queue for test isolation. Acceptance criteria (74 + 232 = 306 tests across 7 files, the issue's bot review named four of them): - attempt.session-lock.test.ts (74 tests, includes the new cases) - diagnostic-stuck-session-recovery.runtime.test.ts - heartbeat-runner.skips-busy-session-lane.test.ts - model-fallback.test.ts - failover-error.test.ts Closes openclaw#85913.
…ardown ClawSweeper review on the previous force-push found that the rebase that dropped the kesslerio drain commit also removed unrelated infrastructure: the EmbeddedAttemptSessionLockController.dispose() method, its outer- finally call site in runEmbeddedAttempt, and two existing test cases on main that exercise it. Without dispose() the eager session lock leaks on post-prompt error paths (the original openclaw#86014 fix). The new activePromptSessionTurn introduced by this PR makes the gap worse — teardown after releaseForPrompt without going through reacquireAfterPrompt leaves the file-scoped queue tail unresolved and the next same-file controller waits forever (bot finding, confidence 0.9). Restored: - EmbeddedAttemptSessionLockController.dispose(): Promise<void> back on the type and implementation. Behavior: vacate activePromptSessionTurn first (new), then release heldLock if held (the original main behavior). Idempotent — both heldLock and activePromptSessionTurn are cleared after first call. - releaseRetainedSessionLock infrastructure in runEmbeddedAttempt: declaration before the try block, assignment after the controller constructor, and the outer-finally invocation with the log.error fallback. Restored from main as it was before this branch's rebase. - The two existing dispose tests from main (openclaw#86014): "releases the eagerly-held attempt lock on dispose when cleanup is skipped" and "dispose does not double-release a lock already handed to cleanup". New regression for the prompt-turn cleanup invariant: - "dispose-after-releaseForPrompt vacates the prompt turn so the next same-file controller does not hang". Verifies the case the bot flagged: controller A calls releaseForPrompt, then dispose without going through reacquireAfterPrompt. Controller B on the same session file must be able to releaseForPrompt without hanging. The test races releaseForPrompt against a 200 ms timeout and asserts the release resolves first. 80 cases across 2 test files green. Closes openclaw#85913.
9d03149 to
2606e52
Compare
|
Force-pushed Two issues from the prior ClawSweeper review resolved:
The two pre-existing Diff stats post-rebase: Source +140 / Tests +267 across 3 files (was +345 / +306 pre-rebase, much of the apparent "delta" was upstream code that landed via #86427). @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Thanks @ubehera for putting together this focused file-scoped prompt-window guard PR and for iterating through the cleanup/availability edge cases. This PR helped validate the right fix shape: the bug was not the takeover fence itself, but allowing multiple embedded prompt owners to enter the same physical transcript file window. We ended up landing the fix through #87159 in commit 3349fe2. That merged implementation keeps the same core invariant from this PR, but broadens it to the rest of the observed failure surface: canonical session-file ownership, active-run indexing by session file, reply/recovery sibling lookup, and production diagnostic heartbeat recovery carrying The merged PR also includes focused regressions and AWS Crabbox proof for the same-file sibling recovery and heartbeat path. Since #87159 is now merged, I’m closing this PR as superseded by the landed fix, with thanks for the original implementation and review work here. |
Summary
releaseForSessionIdleWait()for the post-prompt compaction window so it does not block other lanes' real prompt windows.try/catchandtry/finallyso the file-scoped queue is vacated on success, takeover, lock-timeout, fence-mismatch, and mid-flight release errors. A wedge in one lane never leaves later same-file waiters blocked.Fixes #85913.
Background
createEmbeddedAttemptSessionLockControllerreleases the OS-level session write lock during provider streaming so long model calls don't hold the lock for minutes at a time. After streaming,reacquireAfterPrompt()reacquires the lock andassertSessionFileFence()checks an inode/size/mtime fingerprint to detect external mutation. If the file changed in an unowned way, the controller throwsEmbeddedAttemptSessionTakeoverError.The fence is a symptom-catcher, not a race preventer. The live-repro evidence on the issue shows the underlying race: two lanes on the same OpenClaw process can both resolve to the same JSONL (stuck-session recovery firing a fresh embedded run on a session UUID whose file resolution aliases to an already-active file). Both lanes release for prompt simultaneously, both stride into provider streams, the slower one's reacquire trips the fence after the provider call has already produced output, and the chain falls all the way to
Embedded agent failed before reply: All models failed. The user sees a dropped reply.RuntimeEnvis module-scoped atownedSessionFileWrites/trustedSessionFileStates— both keyed bypath.resolve(sessionFile). The fix extends that pattern with a third process-global map: a per-file queue of prompt turns.What changed
src/agents/pi-embedded-runner/run/attempt.session-lock.tspromptSessionFileTurnTails: Map<string, Promise<void>>keyed by the sameresolveSessionFileFenceKeythe fence already uses.acquirePromptSessionFileTurn(sessionFileKey)returns{ awaitPrior, release }. The first entrant on a file getsawaitPrior: undefinedand falls through with no extra acquire/release. The second entrant gets the first's tail and waits on it before its own prompt window opens.performHeldLockReleaseForWindow(registerPromptHolder)consolidates the held-lock release path. WhenregisterPromptHolderis true and a prior turn exists, the entrant yields its OS lock, awaits the prior turn, reacquires fresh, and then continues with the existing fingerprint/snapshot/release sequence. Wrapped intry/catchso any exception between acquiring the turn and the controller taking ownership releases the turn slot before rethrowing.releaseForPrompt()becomes a one-liner overperformHeldLockReleaseForWindow(true).releaseForSessionIdleWait()(new public method) becomes a one-liner overperformHeldLockReleaseForWindow(false). Used for the post-prompt compaction window, which is not a provider-prompt window and must not serialise other lanes' prompts.reacquireAfterPrompt()releases the turn in afinallyblock — the success path and every error branch (lock-timeout, takeover fence, any other exception) all vacate the slot. The no-op-early-return branch (takeoverDetected || heldLock) also releases the turn so a no-op entrant cannot strand the queue.resetEmbeddedAttemptSessionFilePromptGuardsForTest()(new test helper) clears the module-level map so tests start from a clean slate.releaseForSessionIdleWaiton theEmbeddedAttemptSessionLockControllertype.src/agents/pi-embedded-runner/run/attempt.tsreleaseForSessionIdleWait()instead ofreleaseForPrompt(). This path was already a non-prompt release — the rename makes it not contend with real prompt windows on the same file. The matching post-compactionwithSessionWriteLockreacquires the OS lock internally; no other behavior changes.src/agents/pi-embedded-runner/run/attempt.session-lock.test.tsafterEachclears the new queue alongside the existingresetSessionWriteLockStateForTest()cleanup.resetEmbeddedAttemptSessionFilePromptGuardsForTest()between the first controller'sreleaseForPrompt()and the second controller'sreleaseForPrompt(). These tests were written before the file-scoped guard existed and usereleaseForPrompt()as a "drop the OS lock" call without a pairedreacquireAfterPrompt(). The reset preserves their existing assertions without forcing them to invent reacquire calls they don't need.describeblock:releaseForPrompt()blocks until the first callsreacquireAfterPrompt().reacquireAfterPrompt()rejects with aSessionWriteLockTimeoutError; the second controller'sreleaseForPrompt()still completes (the turn was released in thefinally).releaseForPrompt(); the second controller'sreleaseForPrompt()still completes (the turn was released in thecatchbefore re-throwing).releaseForSessionIdleWait()does not block another lane'sreleaseForPrompt()on the same file.sessionFiles do not serialise.Verification
pnpm check:changedexit 0 (typecheck, lint, runtime import cycles, repo guards).Real behavior proof
Behavior or issue addressed: Two concurrent embedded attempts on the same resolved session JSONL no longer race during the provider-streaming window. The losing lane's reply was being dropped because the takeover fence fired mid-stream after the provider had already started producing output. With the file-scoped turn queue, the second entrant waits for the first to reacquire (or fail cleanly) before its own prompt window opens. Failure-path cleanup ensures one wedged prompt cannot block later prompts on the same file.
Real environment tested: Local OpenClaw source checkout (macOS 15.5, Node 22.19+) at the patched branch HEAD
65705d8c39. The patchedcreateEmbeddedAttemptSessionLockControllerruns in two real concurrent controllers against a shared session JSONL with the productionacquireSessionWriteLockcontract; per-microsecond timestamps and the lock-acquire/release event ledger are captured below. The acceptance-criteria test sweep from the issue's bot review runs against the same patched source.Exact steps or command run after this patch:
Evidence after fix:
Before/after reproduction — same scenario run against pre-patch (
5be62e779b, the commit one before this fix) and post-patch (65705d8c39) source via the samerun-same-file-race-repro.mjsscript. Two realEmbeddedAttemptSessionLockControllerinstances on a shared temp session JSONL; productionacquireSessionWriteLock(no test mocks); external mutation injected mid-prompt to simulate the aliased-UUID concurrent lane from the live-repro evidence in EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) #85913:Read the timestamps: on pre-patch, both controllers enter their mid-prompt sleep at ms 2 and 9 simultaneously (both have active fences capturing the same stale fingerprint), so when the external write hits at ms 111, both lanes observe the mutation when they reacquire at ms ~210. On post-patch, lane B's
prompt in flightlog line does not appear until ms 210 — which is the same moment lane A's reacquire throws. The file-scoped queue held lane B'sreleaseForPromptuntil lane A's turn vacated; by then the file was stable, so lane B's release captured a fresh fingerprint and its reacquire succeeded at ms 415.Lane A's takeover is the unavoidable case (a true external mutation during its prompt window — no coordination primitive can save it; the fence correctly fires as the safety net). Lane B is the cascade casualty the patch saves.
Internal controller proof — earlier in the proof pipeline, two real controllers driven through same-file contention with timestamped event ledger. Verbatim stdout from the run on this branch (
65705d8c39):Scenario A's event ledger is the key observable: lock A-1 is the first controller's initial acquire, A-2 the second's initial acquire, A-3 is the first controller's reacquire-after-prompt (which fires the turn release on the queue), and A-4 is the second controller's reacquire-fresh after its prior-wait dance. Without the file-scoped guard, both A-1 and A-2 would release simultaneously and both controllers would stride into provider streams.
Scenario B's timestamps prove the cleanup-on-failure path is wired correctly:
secondCompletedflips fromfalse→truewithin 80 µs of the first controller's reacquire throwing. IfreacquireAfterPrompt's finally block weren't releasing the turn on the error path, this assertion would have timed out instead.Targeted session-lock suite (74 tests including the 5 new file-scoped guard regressions):
Full acceptance-criteria sweep (the five files the issue's bot review named):
The two regression tests that distinguish a correct fix from a half-fix are:
vacates the prompt turn so later waiters proceed when reacquireAfterPrompt throws— drives the first controller's reacquire mock to reject withSessionWriteLockTimeoutError. Asserts that a second waiter blocked on the first's turn completes itsreleaseForPrompt()after the throw. If the cleanup is wired only on the success path, this test times out (the second waiter blocks forever on a turn promise that never resolves).vacates the prompt turn when releaseForPrompt itself fails mid-flight— drives the first controller's held-lock release to reject. The firstreleaseForPrompt()throws partway, but thecatchinperformHeldLockReleaseForWindowreleases the turn before rethrowing. A secondreleaseForPrompt()then completes cleanly. If the cleanup is only inreacquireAfterPrompt, this test also times out.Both of these are deterministic — they trigger the failure path via
vi.fn().mockRejectedValueOnce(...), not via timing.Observed result after fix: Concurrent same-file prompt windows serialise instead of racing. Failed prompt windows (lock-timeout, takeover, release error) vacate the file-scoped queue so subsequent prompts on the same file proceed. The takeover fence remains in place as a defense-in-depth check against external mutation.
What was not tested: A live multi-agent gateway run that fires stuck-session recovery on an aliased session UUID. The unit tests cover the locking primitive directly through its public contract; the existing
heartbeat-runner.skips-busy-session-lane.test.tsanddiagnostic-stuck-session-recovery.runtime.test.tscontinue to pass and exercise the surrounding flows. Verifying the end-to-end fix against the real 122-event/3-day takeover pattern from the issue would require running a patched gateway under that workload for the same window.Before evidence: Without the file-scoped guard, two same-file controllers both call
releaseForPrompt(), both stride into provider streams, and whichever reacquires second observes the other's writes and throwsEmbeddedAttemptSessionTakeoverError. The chain falls throughmodel-fallback→failover-error→Embedded agent failed before reply: All models failed. The live log slices in EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) #85913 show 122 such events across 3 days on one local gateway, including the user-visible drop trace at9dada894-1cb4-4610-a135-68d5ece67e51on 2026-05-19 02:42:34.Refreshed proof for current HEAD
2606e52d74(post-rebase + dispose cleanup)The PR was rebased onto current
upstream/main(which includes #86427's retained-lockdispose()from32ddfc22f5). The merge conflict inattempt.session-lock.tswas resolved by keeping the newactivePromptSessionTurn.release()block at the top ofdispose()and preserving the existingheldLockrelease that landed in main. Run against the rebased HEAD2606e52d74:Test suite — 80 cases across 2 files green:
The two existing dispose tests from #86014 ("releases the eagerly-held attempt lock on dispose when cleanup is skipped" and "dispose does not double-release a lock already handed to cleanup") still pass after the rebase — the dispose extension is backward-compatible.
Dispose-after-releaseForPrompt proof harness — three scenarios driven against real
EmbeddedAttemptSessionLockControllerinstances on a shared temp session JSONL with productionacquireSessionWriteLock. Verbatim stdout:Scenario B is the bug-shape proof:
ctrlA.releaseForPrompt()setsactivePromptSessionTurnwithout subsequently releasing it via eitherreacquireAfterPromptordispose.ctrlB.releaseForPrompt()then blocks on the prior turn's tail, the 200ms test timeout fires, and the harness recordstimeout. Scenario C confirms thatdispose()'s newactivePromptSessionTurn.release()call at the top of the body fixes this —ctrlB's release completes in 3.52ms.Harness:
/Users/umank/code/openclaw-tickets/proof/run-dispose-after-release-proof.mjs.Risk / compatibility
EmbeddedAttemptSessionLockControlleris internal tosrc/agents/pi-embedded-runner/run/and not exposed throughopenclaw/plugin-sdk/*.releaseForPrompt()keeps its prior signature;releaseForSessionIdleWait()is additive. The only production caller ofreleaseForPromptisinstallPromptSubmissionLockRelease(attempt.session-lock.ts:1014), which already pairs it withreacquireAfterPromptin atry/finally— that wrapper is what makes the file-scoped guard correct in production.promptSessionFileTurnTails) sits alongside the existingownedSessionFileWritesandtrustedSessionFileStatessingletons; same lifetime, same key scheme, same reset hook. No persistent state, no disk artefacts, no cross-process coordination.acquireLock + releasepair on the second entrant to yield the OS lock during the wait — this is necessary because the prior turn-holder must be able to reacquire after their provider call.Security
This PR is AI-assisted. Code authored with Claude.