Skip to content

fix(agents): tolerate in-process session writes during prompt release#84250

Merged
jalehman merged 8 commits into
openclaw:mainfrom
tianxiaochannel-oss88:fix/embedded-session-lock-internal-writes
May 21, 2026
Merged

fix(agents): tolerate in-process session writes during prompt release#84250
jalehman merged 8 commits into
openclaw:mainfrom
tianxiaochannel-oss88:fix/embedded-session-lock-internal-writes

Conversation

@tianxiaochannel-oss88

@tianxiaochannel-oss88 tianxiaochannel-oss88 commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Track exact same-process, lock-owned session-file fingerprints across embedded attempt controllers.
  • Allow a released prompt attempt to refresh its fence only when the current file fingerprint exactly matches a later explicit OpenClaw-owned transcript write.
  • Preserve fail-closed takeover detection for unowned external session-file changes.
  • Forward the owned-write publication flag through the production promptActiveSession transcript context.
  • Split transcript locking from global owned-fingerprint publication, so broad same-process callbacks, cleanup release, and mixed external edits do not publish globally trusted fingerprints.

Root Cause

The embedded attempt session fence only compared the session file's filesystem fingerprint saved at releaseForPrompt() against the fingerprint observed after the prompt lock was reacquired. That correctly rejects unowned external edits, but it also rejects legitimate OpenClaw-managed transcript appends from another same-process controller/attempt that acquired the session write lock while the first attempt was released for model I/O.

The fix keeps the fingerprint fence, but makes it ownership-aware and narrow: OpenClaw records a globally owned post-write fingerprint only for the exact transcript append publication section, routed through runWithOwnedSessionTranscriptWritePublication(), and only when the pre-write fingerprint is already trusted by the process. The broader runWithOwnedSessionTranscriptWriteLock() now only provides lock serialization. A released attempt accepts the change only when the current file fingerprint exactly equals a newer owned fingerprint. Direct external edits, external edits followed by a broad locked append, external edits interleaved during cleanup, and external edits inside the broader owned transcript lock still fail closed.

Fixes #84059.

Real behavior proof

  • Behavior or issue addressed: A prompt-released embedded attempt no longer throws EmbeddedAttemptSessionTakeoverError when another controller publishes an explicit owned transcript append for the same session file. External unowned file changes still throw, including an external edit interleaved during a broad same-process locked callback, an external edit interleaved while another controller holds the cleanup lock, and an external edit inside the broader owned transcript lock before the narrow append publication section.
  • Real environment tested: Local patched OpenClaw checkout at original proof head 7008438d978c75a21e811df4e2dc041d9f9dfe71, then rebased/verified again through current PR head da570f3c09b47fce4ee8bd0c92fbb91d1d3d8ea4, macOS arm64, Node.js v24.15.0, real temp session files on the local filesystem, real OpenClaw acquireSessionWriteLock.
  • Failing proof before fix: The new regression test was run before the implementation and failed with the reported error:
AssertionError: promise rejected "EmbeddedAttemptSessionTakeoverError: sess..." instead of resolving

Caused by: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: .../session.jsonl
  • Exact steps or command run after this patch:
node --import tsx ../proof-embedded-session-lock-in-process.ts
  • Evidence after fix:
OpenClaw proof checkout: /Users/tianxiao/Documents/Codex/2026-05-19/openclaw-message-tool-bug/openclaw-community
In-process locked write result: post-write-committed
In-process takeover detected: false
In-process session line count: 3
External direct write error: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /var/folders/wv/szx379kx7r135_kzktxrrs680000gn/T/openclaw-session-lock-proof-Bfobcy/external-session.jsonl
External takeover detected: true
External session line count: 2
External plus locked write error: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /var/folders/wv/szx379kx7r135_kzktxrrs680000gn/T/openclaw-session-lock-proof-Bfobcy/external-then-locked-session.jsonl
External plus locked takeover detected: true
External plus locked session line count: 3
  • Observed result after fix: The in-process two-controller proof commits the post-prompt write without setting takeover. Separate external direct and external-plus-locked cases still reject with EmbeddedAttemptSessionTakeoverError, proving a locked append does not launder an earlier external edit. The latest unit regressions also prove cleanup-lock release and broad owned transcript lock scopes do not launder external edits into globally owned fingerprints.
  • What was not tested: I did not run a live Feishu, Slack, WebChat, or Discord channel smoke. The proof uses the embedded session-lock controller directly with real filesystem writes to reproduce the source-level race without sending live channel messages.

Tests

After rebasing onto current origin/main (5d775122c1):

  • node scripts/run-vitest.mjs run --config test/vitest/vitest.agents-pi-embedded.config.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts -> 20 passed
  • node scripts/run-vitest.mjs src/config/sessions/transcript.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts -> 3 files, 62 passed
  • node scripts/run-tsgo.mjs -p test/tsconfig/tsconfig.core.test.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core-test.tsbuildinfo -> passed
  • node scripts/run-tsgo.mjs -p tsconfig.core.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core.tsbuildinfo -> passed
  • git diff --check -> passed

Security-boundary update

  • Added a production-shaped prompt context regression so the publishOwnedWrite option is forwarded to sessionLockController.withSessionWriteLock().
  • Removed cleanup-lock release fingerprint publication; cleanup no longer marks changed final session-file state as globally owned.
  • Split broad owned transcript locking from narrow owned transcript write publication.
  • Added regressions for external appends interleaved while another controller holds cleanup lock and inside a broad owned transcript lock.
  • Ordinary broad withSessionWriteLock() callbacks still refresh their own controller fence, but no longer publish a globally owned fingerprint for other released controllers to accept.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 19, 2026
@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs changes before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds owned session-file fingerprint publication and prompt-lock reacquisition so embedded attempts can tolerate same-process transcript writes while still rejecting external session-file edits.

Reproducibility: yes. Current main's source path rejects any prompt-release fingerprint drift, and the PR body/comments include real filesystem proof for the false-takeover path while the remaining stale-baseline hole is source-reproducible from the new trust map logic.

PR rating
Overall: 🧂 unranked krab
Proof: 🦞 diamond lobster
Patch quality: 🧂 unranked krab
Summary: Proof is strong, but the patch is not quality-ready because a stale trusted-baseline hole can leave the target session race unfixed.

Rank-up moves:

  • Add the stale-baseline regression described in the finding.
  • Repair the trust bookkeeping while preserving the external-edit fail-closed tests.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Sufficient (terminal): The PR body and follow-up comments provide terminal proof with real filesystem session files plus focused validation on the current changed paths.

Risk before merge

  • Merging as-is may leave the false session-takeover race unfixed after the first legitimate non-published local owned write advances the session file.
  • The PR overlaps the still-open append-validation approach at fix(agents): stop false-positive session-takeover on runner's own transcript appends #84046, so maintainers should keep one canonical session-fence model.
  • The repair must avoid broad trust of lock-held current state, because that would risk laundering external session-file edits into accepted owned fingerprints.

Maintainer options:

  1. Fix stale trusted baselines before merge (recommended)
    Add a regression where a legitimate non-published owned write advances the file before a later explicit publication, then update the trust bookkeeping so that publication is accepted without accepting external edits.
  2. Pause for the append-validation model
    If maintainers prefer the append-shape validation approach in fix(agents): stop false-positive session-takeover on runner's own transcript appends #84046, pause this PR rather than landing two competing ownership models.
Copy recommended automerge instruction
@clawsweeper automerge

Special instructions:
Add stale trusted-baseline regression coverage in `src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts`, then update `src/agents/pi-embedded-runner/run/attempt.session-lock.ts` so proven same-process owned local writes can advance the trusted baseline while external drift before or inside broad locks still fails closed.

Next step before merge
There is one narrow, source-backed blocker in the session-lock trust bookkeeping with a clear regression-test shape.

Security
Cleared: No supply-chain, credential, or permission issue was found; the session takeover trust-boundary concern is captured as a blocking functional finding.

Review findings

  • [P1] Advance trusted baselines after local owned writes — src/agents/pi-embedded-runner/run/attempt.session-lock.ts:191-195
Review details

Best possible solution:

Keep explicit publication narrow, but advance the trusted baseline for proven same-process owned states and add stale-baseline regressions alongside the existing external-edit fail-closed cases.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main's source path rejects any prompt-release fingerprint drift, and the PR body/comments include real filesystem proof for the false-takeover path while the remaining stale-baseline hole is source-reproducible from the new trust map logic.

Is this the best way to solve the issue?

No. The explicit-publication direction is plausible, but this implementation needs the trusted baseline repair before it is the narrow maintainable fix.

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body and follow-up comments provide terminal proof with real filesystem session files plus focused validation on the current changed paths.

Label justifications:

  • P1: The PR targets an urgent embedded-agent session takeover regression that can abort replies and lose visible response state.
  • merge-risk: 🚨 session-state: The diff changes how released prompt attempts trust session transcript fingerprints across controllers.
  • merge-risk: 🚨 security-boundary: The diff adjusts the fail-closed boundary between OpenClaw-owned and unowned session-file writes.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🦞 diamond lobster, patch quality is 🧂 unranked krab, and Proof is strong, but the patch is not quality-ready because a stale trusted-baseline hole can leave the target session race unfixed.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (terminal): The PR body and follow-up comments provide terminal proof with real filesystem session files plus focused validation on the current changed paths.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body and follow-up comments provide terminal proof with real filesystem session files plus focused validation on the current changed paths.

Full review comments:

  • [P1] Advance trusted baselines after local owned writes — src/agents/pi-embedded-runner/run/attempt.session-lock.ts:191-195
    After the first trusted fingerprint is recorded, this branch returns undefined for any later current fingerprint, even when the file was advanced by a legitimate local owned write such as refreshAfterOwnedSessionWrite() or a broad non-published withSessionWriteLock(). That leaves trustedSessionFileStates stuck on the old fingerprint; a later runWithOwnedSessionTranscriptWritePublication() sees its beforeWrite as untrusted, never records the owned append, and released peer controllers still throw EmbeddedAttemptSessionTakeoverError. Please add that stale-baseline regression and advance only proven same-process owned baselines without accepting external edits.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.84

Acceptance criteria:

  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts src/config/sessions/transcript.test.ts -- --reporter=verbose
  • node scripts/run-tsgo.mjs -p test/tsconfig/tsconfig.core.test.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core-test.tsbuildinfo
  • node scripts/run-tsgo.mjs -p tsconfig.core.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core.tsbuildinfo
  • git diff --check

What I checked:

Likely related people:

  • vincentkoc: Introduced the current-main embedded session fence, local owned-write refresh, and transcript write context changes in the central files. (role: recent area contributor; confidence: high; commits: 2bb00f6726d4; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/config/sessions/transcript-write-context.ts)
  • dr00-eth: Authored the related message-tool delivery session-lock fix that was merged through the ClawSweeper replacement path and is part of the same transcript ownership cluster. (role: adjacent feature contributor; confidence: medium; commits: 7ff20f6dac40, 65030f31649b; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/config/sessions/transcript-write-context.ts)
  • jalehman: Added the PR follow-up for prompt-stream reacquisition after live cron/fallback logs and is assigned on this session-lock review path. (role: assigned reviewer and PR follow-up contributor; confidence: medium; commits: c6f8a9e1458e; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 23c58081d062.

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 19, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. labels May 19, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 19, 2026
@clawsweeper clawsweeper Bot added status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 19, 2026
@dr00-eth

Copy link
Copy Markdown

Cross-linking the related fixes I opened after tracing the live Discord duplicate-reply repro:

This PR appears to address a similar symptom from the locked-delivery-flush angle. In the live repro, the visible reply had already been delivered, but the tool result was later repaired as aborted after the delivery mirror advanced the same transcript outside the active prompt fence; fallback then replayed the user input and produced the duplicate visible reply. #84289 intentionally stays separate from this PR's approach so maintainers can compare which framing best matches expected product behavior.

@tianxiaochannel-oss88

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 19, 2026
@tianxiaochannel-oss88 tianxiaochannel-oss88 marked this pull request as ready for review May 20, 2026 01:22
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🔥 Warming up: real-behavior proof passed; findings, security review, or rank-up moves are still in progress.

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.
What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 21, 2026
@jalehman jalehman requested a review from a team as a code owner May 21, 2026 22:01
@github-actions github-actions Bot added the dependencies-changed PR changes dependency-related files label May 21, 2026
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation channel: discord Channel integration: discord channel: voice-call Channel integration: voice-call channel: whatsapp-web Channel integration: whatsapp-web gateway Gateway runtime extensions: diagnostics-otel Extension: diagnostics-otel cli CLI command changes labels May 21, 2026
@jalehman

Copy link
Copy Markdown
Contributor

Merged via squash.

Thanks @tianxiaochannel-oss88!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: L status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released

3 participants