fix(agents): tolerate in-process session writes during prompt release by tianxiaochannel-oss88 · Pull Request #84250 · openclaw/openclaw

tianxiaochannel-oss88 · 2026-05-19T17:56:36Z

Summary

Track exact same-process, lock-owned session-file fingerprints across embedded attempt controllers.
Allow a released prompt attempt to refresh its fence only when the current file fingerprint exactly matches a later explicit OpenClaw-owned transcript write.
Preserve fail-closed takeover detection for unowned external session-file changes.
Forward the owned-write publication flag through the production promptActiveSession transcript context.
Split transcript locking from global owned-fingerprint publication, so broad same-process callbacks, cleanup release, and mixed external edits do not publish globally trusted fingerprints.

Root Cause

The embedded attempt session fence only compared the session file's filesystem fingerprint saved at releaseForPrompt() against the fingerprint observed after the prompt lock was reacquired. That correctly rejects unowned external edits, but it also rejects legitimate OpenClaw-managed transcript appends from another same-process controller/attempt that acquired the session write lock while the first attempt was released for model I/O.

The fix keeps the fingerprint fence, but makes it ownership-aware and narrow: OpenClaw records a globally owned post-write fingerprint only for the exact transcript append publication section, routed through runWithOwnedSessionTranscriptWritePublication(), and only when the pre-write fingerprint is already trusted by the process. The broader runWithOwnedSessionTranscriptWriteLock() now only provides lock serialization. A released attempt accepts the change only when the current file fingerprint exactly equals a newer owned fingerprint. Direct external edits, external edits followed by a broad locked append, external edits interleaved during cleanup, and external edits inside the broader owned transcript lock still fail closed.

Fixes #84059.

Real behavior proof

Behavior or issue addressed: A prompt-released embedded attempt no longer throws EmbeddedAttemptSessionTakeoverError when another controller publishes an explicit owned transcript append for the same session file. External unowned file changes still throw, including an external edit interleaved during a broad same-process locked callback, an external edit interleaved while another controller holds the cleanup lock, and an external edit inside the broader owned transcript lock before the narrow append publication section.
Real environment tested: Local patched OpenClaw checkout at original proof head 7008438d978c75a21e811df4e2dc041d9f9dfe71, then rebased/verified again through current PR head da570f3c09b47fce4ee8bd0c92fbb91d1d3d8ea4, macOS arm64, Node.js v24.15.0, real temp session files on the local filesystem, real OpenClaw acquireSessionWriteLock.
Failing proof before fix: The new regression test was run before the implementation and failed with the reported error:

AssertionError: promise rejected "EmbeddedAttemptSessionTakeoverError: sess..." instead of resolving

Caused by: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: .../session.jsonl

Exact steps or command run after this patch:

node --import tsx ../proof-embedded-session-lock-in-process.ts

Evidence after fix:

OpenClaw proof checkout: /Users/tianxiao/Documents/Codex/2026-05-19/openclaw-message-tool-bug/openclaw-community
In-process locked write result: post-write-committed
In-process takeover detected: false
In-process session line count: 3
External direct write error: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /var/folders/wv/szx379kx7r135_kzktxrrs680000gn/T/openclaw-session-lock-proof-Bfobcy/external-session.jsonl
External takeover detected: true
External session line count: 2
External plus locked write error: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /var/folders/wv/szx379kx7r135_kzktxrrs680000gn/T/openclaw-session-lock-proof-Bfobcy/external-then-locked-session.jsonl
External plus locked takeover detected: true
External plus locked session line count: 3

Observed result after fix: The in-process two-controller proof commits the post-prompt write without setting takeover. Separate external direct and external-plus-locked cases still reject with EmbeddedAttemptSessionTakeoverError, proving a locked append does not launder an earlier external edit. The latest unit regressions also prove cleanup-lock release and broad owned transcript lock scopes do not launder external edits into globally owned fingerprints.
What was not tested: I did not run a live Feishu, Slack, WebChat, or Discord channel smoke. The proof uses the embedded session-lock controller directly with real filesystem writes to reproduce the source-level race without sending live channel messages.

Tests

After rebasing onto current origin/main (5d775122c1):

node scripts/run-vitest.mjs run --config test/vitest/vitest.agents-pi-embedded.config.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts -> 20 passed
node scripts/run-vitest.mjs src/config/sessions/transcript.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts -> 3 files, 62 passed
node scripts/run-tsgo.mjs -p test/tsconfig/tsconfig.core.test.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core-test.tsbuildinfo -> passed
node scripts/run-tsgo.mjs -p tsconfig.core.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core.tsbuildinfo -> passed
git diff --check -> passed

Security-boundary update

Added a production-shaped prompt context regression so the publishOwnedWrite option is forwarded to sessionLockController.withSessionWriteLock().
Removed cleanup-lock release fingerprint publication; cleanup no longer marks changed final session-file state as globally owned.
Split broad owned transcript locking from narrow owned transcript write publication.
Added regressions for external appends interleaved while another controller holds cleanup lock and inside a broad owned transcript lock.
Ordinary broad withSessionWriteLock() callbacks still refresh their own controller fence, but no longer publish a globally owned fingerprint for other released controllers to accept.

clawsweeper · 2026-05-19T17:57:59Z

Codex review: needs changes before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds owned session-file fingerprint publication and prompt-lock reacquisition so embedded attempts can tolerate same-process transcript writes while still rejecting external session-file edits.

Reproducibility: yes. Current main's source path rejects any prompt-release fingerprint drift, and the PR body/comments include real filesystem proof for the false-takeover path while the remaining stale-baseline hole is source-reproducible from the new trust map logic.

PR rating
Overall: 🧂 unranked krab
Proof: 🦞 diamond lobster
Patch quality: 🧂 unranked krab
Summary: Proof is strong, but the patch is not quality-ready because a stale trusted-baseline hole can leave the target session race unfixed.

Rank-up moves:

Add the stale-baseline regression described in the finding.
Repair the trust bookkeeping while preserving the external-edit fail-closed tests.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Sufficient (terminal): The PR body and follow-up comments provide terminal proof with real filesystem session files plus focused validation on the current changed paths.

Risk before merge

Merging as-is may leave the false session-takeover race unfixed after the first legitimate non-published local owned write advances the session file.
The PR overlaps the still-open append-validation approach at fix(agents): stop false-positive session-takeover on runner's own transcript appends #84046, so maintainers should keep one canonical session-fence model.
The repair must avoid broad trust of lock-held current state, because that would risk laundering external session-file edits into accepted owned fingerprints.

Maintainer options:

Fix stale trusted baselines before merge (recommended)
Add a regression where a legitimate non-published owned write advances the file before a later explicit publication, then update the trust bookkeeping so that publication is accepted without accepting external edits.
Pause for the append-validation model
If maintainers prefer the append-shape validation approach in fix(agents): stop false-positive session-takeover on runner's own transcript appends #84046, pause this PR rather than landing two competing ownership models.

Copy recommended automerge instruction

@clawsweeper automerge

Special instructions:
Add stale trusted-baseline regression coverage in `src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts`, then update `src/agents/pi-embedded-runner/run/attempt.session-lock.ts` so proven same-process owned local writes can advance the trusted baseline while external drift before or inside broad locks still fails closed.

Next step before merge
There is one narrow, source-backed blocker in the session-lock trust bookkeeping with a clear regression-test shape.

Security
Cleared: No supply-chain, credential, or permission issue was found; the session takeover trust-boundary concern is captured as a blocking functional finding.

Review findings

[P1] Advance trusted baselines after local owned writes — src/agents/pi-embedded-runner/run/attempt.session-lock.ts:191-195

Review details

Best possible solution:

Keep explicit publication narrow, but advance the trusted baseline for proven same-process owned states and add stale-baseline regressions alongside the existing external-edit fail-closed cases.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main's source path rejects any prompt-release fingerprint drift, and the PR body/comments include real filesystem proof for the false-takeover path while the remaining stale-baseline hole is source-reproducible from the new trust map logic.

Is this the best way to solve the issue?

No. The explicit-publication direction is plausible, but this implementation needs the trusted baseline repair before it is the narrow maintainable fix.

Label changes:

add proof: sufficient: Contributor real behavior proof is sufficient. The PR body and follow-up comments provide terminal proof with real filesystem session files plus focused validation on the current changed paths.

Label justifications:

P1: The PR targets an urgent embedded-agent session takeover regression that can abort replies and lose visible response state.
merge-risk: 🚨 session-state: The diff changes how released prompt attempts trust session transcript fingerprints across controllers.
merge-risk: 🚨 security-boundary: The diff adjusts the fail-closed boundary between OpenClaw-owned and unowned session-file writes.
rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🦞 diamond lobster, patch quality is 🧂 unranked krab, and Proof is strong, but the patch is not quality-ready because a stale trusted-baseline hole can leave the target session race unfixed.
status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (terminal): The PR body and follow-up comments provide terminal proof with real filesystem session files plus focused validation on the current changed paths.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body and follow-up comments provide terminal proof with real filesystem session files plus focused validation on the current changed paths.

Full review comments:

[P1] Advance trusted baselines after local owned writes — src/agents/pi-embedded-runner/run/attempt.session-lock.ts:191-195
After the first trusted fingerprint is recorded, this branch returns undefined for any later current fingerprint, even when the file was advanced by a legitimate local owned write such as refreshAfterOwnedSessionWrite() or a broad non-published withSessionWriteLock(). That leaves trustedSessionFileStates stuck on the old fingerprint; a later runWithOwnedSessionTranscriptWritePublication() sees its beforeWrite as untrusted, never records the owned append, and released peer controllers still throw EmbeddedAttemptSessionTakeoverError. Please add that stale-baseline regression and advance only proven same-process owned baselines without accepting external edits.
Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.84

Acceptance criteria:

node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts src/config/sessions/transcript.test.ts -- --reporter=verbose
node scripts/run-tsgo.mjs -p test/tsconfig/tsconfig.core.test.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core-test.tsbuildinfo
node scripts/run-tsgo.mjs -p tsconfig.core.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core.tsbuildinfo
git diff --check

What I checked:

PR trust baseline logic: The PR records one trusted fingerprint per session file, but trustSessionFileState() returns undefined when the current fingerprint differs from the existing trusted state instead of advancing a proven local-owned baseline. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:187, 28f9b9055122)
Owned publication gate: publishOwnedSessionFileWriteIfChanged() only records a global owned write when the pre-write fingerprint is already trusted, so a stale trusted baseline prevents later explicit publications from being visible to released peer controllers. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:452, 28f9b9055122)
Local owned writes are not globally published: refreshAfterOwnedSessionWrite() still refreshes only the controller-local fence, so normal SessionManager writes can advance the file without advancing trustedSessionFileStates. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:510, 28f9b9055122)
Current-main strict fence behavior: Current main still throws EmbeddedAttemptSessionTakeoverError whenever the saved prompt-release fingerprint differs from the reacquired fingerprint. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:354, 23c58081d062)
History provenance: Current main's embedded session fence and local owned-write refresh behavior were introduced in 2bb00f6726d4245c673c40bcbfe4416879e3512f. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:313, 2bb00f6726d4)
Related session-fence context: The discussion links the same false-takeover cluster to EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released #84059 and the competing open approach at fix(agents): stop false-positive session-takeover on runner's own transcript appends #84046; fix(pi): keep message-tool delivery in session lock #84437 already merged the delivery-mirror session-lock path this PR builds on.

Likely related people:

vincentkoc: Introduced the current-main embedded session fence, local owned-write refresh, and transcript write context changes in the central files. (role: recent area contributor; confidence: high; commits: 2bb00f6726d4; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/config/sessions/transcript-write-context.ts)
dr00-eth: Authored the related message-tool delivery session-lock fix that was merged through the ClawSweeper replacement path and is part of the same transcript ownership cluster. (role: adjacent feature contributor; confidence: medium; commits: 7ff20f6dac40, 65030f31649b; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/config/sessions/transcript-write-context.ts)
jalehman: Added the PR follow-up for prompt-stream reacquisition after live cron/fallback logs and is assigned on this session-lock review path. (role: assigned reviewer and PR follow-up contributor; confidence: medium; commits: c6f8a9e1458e; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 23c58081d062.

dr00-eth · 2026-05-19T19:51:49Z

Cross-linking the related fixes I opened after tracing the live Discord duplicate-reply repro:

fix(pi): keep message-tool delivery in session lock #84289 fixes the false session-takeover root cause by routing owned delivery-mirror transcript appends through the active Pi attempt's session write lock, then treating a successful in-channel message(action=send) as terminal for message_tool_only source replies.
fix(discord): avoid duplicate typing keepalive for tool replies #84288 handles the adjacent Discord typing lifecycle issue by avoiding duplicate typing keepalive ownership for message-tool-only replies.

This PR appears to address a similar symptom from the locked-delivery-flush angle. In the live repro, the visible reply had already been delivered, but the tool result was later repaired as aborted after the delivery mirror advanced the same transcript outside the active prompt fence; fallback then replayed the user input and produced the duplicate visible reply. #84289 intentionally stays separate from this PR's approach so maintainers can compare which framing best matches expected product behavior.

tianxiaochannel-oss88 · 2026-05-19T20:13:19Z

@clawsweeper re-review

clawsweeper · 2026-05-19T20:15:07Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26122638265
Updated: 2026-05-19T20:22:16.257Z

clawsweeper · 2026-05-20T01:29:34Z

ClawSweeper PR egg

🔥 Warming up: real-behavior proof passed; findings, security review, or rank-up moves are still in progress.

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

Merged PRs are hatchable.
Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

What is this egg doing here?

Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

jalehman · 2026-05-21T22:06:17Z

Merged via squash.

Prepared head SHA: 33f88febc3225ebecd47d29e4eeefac0e43566ef
Merge commit: 1b77145687ca6c5234d8f853e155199340e637f4

Thanks @tianxiaochannel-oss88!

openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 19, 2026

openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 19, 2026

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 19, 2026

clawsweeper Bot added status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 19, 2026

This was referenced May 19, 2026

fix(discord): avoid duplicate typing keepalive for tool replies #84288

Open

fix(pi): keep message-tool delivery in session lock #84289

Closed

dr00-eth mentioned this pull request May 19, 2026

EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released #84059

Closed

openclaw-barnacle Bot added size: M and removed size: S labels May 19, 2026

tianxiaochannel-oss88 marked this pull request as ready for review May 20, 2026 01:22

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 21, 2026

jalehman requested a review from a team as a code owner May 21, 2026 22:01

github-actions Bot added the dependencies-changed PR changes dependency-related files label May 21, 2026

tianxiaochannel-oss88 and others added 8 commits May 21, 2026 18:04

fix(agents): tolerate in-process session writes during prompt release

3b630cd

fix(agents): avoid publishing unowned session fingerprints

ff3066e

fix(agents): trust session fingerprints before publishing

25b9559

fix(agents): narrow session lock owned write publishing

02cad4a

fix(agents): close session lock cleanup trust gaps

f59991c

fix(agents): reacquire embedded session lock after prompt

0e709a3

test(sessions): account for owned publication lock wrapper

5e1b021

docs: add session takeover changelog entry

33f88fe

github-actions Bot mentioned this pull request May 21, 2026

📡 Upstream Digest — 2026-05-21 22:58 UTC curtismercier/openclaw-mods#912

Open

clawsweeper Bot mentioned this pull request May 22, 2026

Multi-agent orchestration is unstable: concurrent agents add/config overwrites, session-lock failures, and detached child work #43367

Open

scotthuang mentioned this pull request May 22, 2026

Fix/defer assistant media transcript during active run #84165

Closed

23 tasks

clawsweeper Bot mentioned this pull request May 25, 2026

[Bug]: EmbeddedAttemptSessionTakeoverError during Discord runs: session file changed while embedded prompt lock was released #86508

Open

ubehera mentioned this pull request May 25, 2026

Hoist withOwnedSessionTranscriptWrites ALS scope to span agent.prompt() to fix vanilla-openclaw same-lane fence trip #86572

Open

rafaelreis-r mentioned this pull request May 25, 2026

Regression on v2026.5.22: event-loop starvation returns (87s session-lock phase, 31s loop delay) — ref #80695 #86509

Closed

This was referenced May 25, 2026

fix(agents): file-scoped prompt-window guard for same-session embedded races #86067

Closed

EmbeddedAttemptSessionTakeoverError cluster on 2026.5.22 — 13 events / 7 jobs / 42h #86845

Open

augusteo mentioned this pull request May 29, 2026

2026.5.18-boon.1 — session-takeover hardening getboon/openclaw#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): tolerate in-process session writes during prompt release#84250

fix(agents): tolerate in-process session writes during prompt release#84250
jalehman merged 8 commits into
openclaw:mainfrom
tianxiaochannel-oss88:fix/embedded-session-lock-internal-writes

tianxiaochannel-oss88 commented May 19, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 19, 2026 •

edited

Loading

Uh oh!

dr00-eth commented May 19, 2026

Uh oh!

tianxiaochannel-oss88 commented May 19, 2026

Uh oh!

clawsweeper Bot commented May 19, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 20, 2026 •

edited

Loading

Uh oh!

jalehman commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tianxiaochannel-oss88 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Real behavior proof

Tests

Security-boundary update

Uh oh!

clawsweeper Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr00-eth commented May 19, 2026

Uh oh!

tianxiaochannel-oss88 commented May 19, 2026

Uh oh!

clawsweeper Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hatch command

Uh oh!

jalehman commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianxiaochannel-oss88 commented May 19, 2026 •

edited

Loading

clawsweeper Bot commented May 19, 2026 •

edited

Loading

clawsweeper Bot commented May 19, 2026 •

edited

Loading

clawsweeper Bot commented May 20, 2026 •

edited

Loading