fix(agents): gate owned-write publish on pre-append fingerprint (#86572) by ubehera · Pull Request #86584 · openclaw/openclaw

ubehera · 2026-05-25T17:20:06Z

Summary

Converts the embedded session-takeover fence's post-write handling from a fail-open unconditional fence refresh into a fail-closed, trust-gated owned-write publish.

On current main, after a lane's own message persists, onMessagePersisted → refreshAfterOwnedSessionWrite() unconditionally re-reads the session file and trusts whatever is on disk as the new fence baseline. If an external mutation interleaves between releaseForPrompt (fence = F0) and the lane's own append (F1 → F2), that external mutation is silently laundered into the fence and the takeover detector never trips.

This PR captures the file fingerprint immediately before the lane's append (beforeMessagePersist) and records the post-write state as owned only when that pre-append fingerprint was trusted (matches the active fence, is a benign transcript-only advance, or is a previously-trusted state). Otherwise it refuses to record — so an interleaved external mutation can no longer be laundered, and the fence still trips.

Closes #86572. Complementary to #87159: that PR closes the cross-lane file-ownership race; this PR closes the same-lane publish boundary. The two enforce orthogonal invariants.

Fix shape (vs. #86572's proposal)

#86572 proposed hoisting the withOwnedSessionTranscriptWrites ALS scope to span agent.prompt(). This PR instead gates at the publish boundary (onMessagePersisted): pi's session writes already funnel through the guarded sessionManager.appendMessage seam, so the trust decision belongs in the controller that owns the fence, not in the ALS scope. This keeps the trust decision at the ownership boundary and avoids classifying unrelated listener writes as owned.

Rebase note (force-pushed, head `43baac80c5`)

Rebased onto current upstream/main, picking up:

#87159 — canonical session-file ownership indexing
#87409 — removed installSessionExternalHookWriteLock / installSessionEventWriteLock, moved hook locking into the controller

Production context — corrected

Correction (see @htuyer121's comment): an earlier revision of this PR cited @htuyer121's EmbeddedAttemptSessionTakeoverError corpus as evidence the class "continues firing after #87159." That framing was inaccurate and the reporter has corrected it.

All 30 of their events were on the pre-Fix embedded session file ownership race #87159 build (2026.5.26). Fix embedded session file ownership race #87159 reached their install via v2026.5.27 on 2026-05-29.
Since deploying Fix embedded session file ownership race #87159: zero events over ~6 days; former hot-spot jobs now run clean through to delivery.
Once Fix embedded session file ownership race #87159 closes the cross-lane boundary, their production window cannot isolate a same-lane-only race.

So there is no current production signal for the specific same-lane path this PR hardens. Per the reporter's own recommendation, this PR should be weighed on its source-level proof, not on that corpus. This is defense-in-depth hardening of the publish boundary — not a fix for an actively-firing production class.

Real behavior proof

Behavior addressed: Converts the post-owned-write fence update from fail-open (unconditional refresh that can launder an interleaved external mutation into the fence) to fail-closed (record-as-owned only when the pre-append fingerprint was trusted). The behavioral delta versus current main is the mixed-interleave case (Scenario D): main launders the external mutation into the fence; this PR refuses to, so the takeover still trips. Clean same-lane writes (Scenario B) behave identically to main.
Real environment tested: Source-level runtime harness on rebased branch 43baac80c5: macOS 26.5, Node 26.0.0, branched off current upstream/main (post-Fix embedded session file ownership race #87159 + fix(agents): move session write lock into owned session runtime #87409); loads the patched EmbeddedAttemptSessionLockController from source via dynamic import and drives four runtime scenarios against real temp session JSONL files via fs.appendFile and controller.withSessionWriteLock.
Exact steps or command run after this patch: git fetch origin && git checkout fix/session-takeover-refresh-before-hook-lock (confirm at 43baac80c5), then node --import tsx /Users/umank/code/openclaw-tickets/proof/run-refresh-before-hook-lock-proof.mjs /Users/umank/code/openclaw. Checked-in regression: node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts.
Evidence after fix: terminal capture from the rebased branch follows. Note: in the harness, Scenario A's "pre-fix" label means the controller is simply not informed of the write (neither main's refreshAfterOwnedSessionWrite nor this PR's publishOwnedPostMessageWrite is called) — it demonstrates the controller mechanics, not a literal main reproduction. The behavioral delta over main is Scenario D.

============================================================
PR #86584 — publish-on-onMessagePersisted proof (Issue #86572)
============================================================
source under test: src/agents/embedded-agent-runner/run/attempt.session-lock.ts
                   publishOwnedPostMessageWrite(beforeWrite)
                   gated on isTrustedSessionFileState(key, beforeWrite)

-------- Scenario A — pre-fix: pi-style write WITHOUT publish (trips takeover) --------
  [   0.13ms] setup      mode=pre-fix
  [   0.15ms] lifecycle  releaseForPrompt() — capture fence F0 + mark trusted
  [   0.47ms] pi-write   pi appendFileSync (F0 → F1) — no publish call
  [   0.60ms] hook       controller.withSessionWriteLock(beforeToolCallBody)
  [   1.62ms] hook       RESULT: TAKEOVER (EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /var/folders/hw/b1qn975s06qg...)
  [   1.64ms] state      hasSessionTakeover() = true

-------- Scenario B — post-fix: pi-style write WITH publish (hook fires cleanly) --------
  [   0.05ms] lifecycle  releaseForPrompt() — capture fence F0 + mark trusted
  [   0.18ms] pi-write   pi appendFileSync (F0 → F1) with pre-write captured
  [   0.27ms] publish    publishOwnedPostMessageWrite(F0) — F0 is trusted, record F1 as owned
  [   0.34ms] hook       controller.withSessionWriteLock(beforeToolCallBody)
  [   0.46ms] hook       RESULT: hook fired successfully (hookOutcome.fired=true)
  [   0.47ms] state      hasSessionTakeover() = false

-------- Scenario C — negative: external write bypasses publish (still trips) --------
  [   0.12ms] pi-write   EXTERNAL appendFileSync (F0 → F1) — no publish
  [   0.34ms] hook       RESULT: TAKEOVER (EmbeddedAttemptSessionTakeoverError)
  [   0.35ms] state      hasSessionTakeover() = true

-------- Scenario D — mixed interleave: external THEN pi write + publish (still trips) --------
  [   0.12ms] external   EXTERNAL appendFileSync first (F0 → F1)
  [   0.20ms] pi-write   pi appendFileSync second (F1 → F2) — pre-write captured = F1
  [   0.28ms] publish    publishOwnedPostMessageWrite(F1) — F1 NOT trusted, skip (fail closed)
  [   0.50ms] hook       RESULT: TAKEOVER (EmbeddedAttemptSessionTakeoverError)
  [   0.51ms] state      hasSessionTakeover() = true

============================================================
Summary
============================================================
  Scenario A (no fix):       hook fired=false, threw=TAKEOVER
  Scenario B (with fix):     hook fired=true, threw=no
  Scenario C (negative):     hook fired=false, threw=TAKEOVER
  Scenario D (mixed):        hook fired=false, threw=TAKEOVER

RESULT: PASS — all four scenarios behaved as expected.

Observed result after fix: Scenarios A/B/C demonstrate the gate mechanics; Scenario D is the behavioral improvement over main — main's unconditional refresh would record the combined F2 (external + pi) state as the trusted fence and silently absorb the external mutation, whereas the trust gate sees the untrusted pre-append baseline (F1) and refuses to record, so the takeover fence still fires. The checked-in regression covers the publish / skip / mixed-interleave cases directly.
What was not tested:
1. A live runtime reproduction against @htuyer121's corpus on the rebased branch — that corpus is pre-Fix embedded session file ownership race #87159 (see corrected timeline above), so it cannot corroborate the post-Fix embedded session file ownership race #87159 same-lane path.
2. Known limitation: the trust gate captures the pre-append fingerprint immediately before appendFileSync. A concurrent external writer appending benign-looking transcript JSONL inside that narrow window could still be absorbed. Preventing concurrent writers to the same session file is the responsibility of Fix embedded session file ownership race #87159's file-ownership boundary, not this publish-boundary gate.

Compatibility / Risk

publishOwnedPostMessageWrite(beforeWrite) is an additive method on EmbeddedAttemptSessionLockController.
beforeMessagePersist is an additive optional opt on the session guard. Backward compatible.
The trust gate is fail-closed: any baseline mismatch refuses to record. Cannot launder external mutations.

Issue Hoist withOwnedSessionTranscriptWrites ALS scope to span agent.prompt() to fix vanilla-openclaw same-lane fence trip #86572 — root cause and acceptance criteria (this PR takes the publish-boundary fix shape, not the proposed ALS-scope hoist)
Fix embedded session file ownership race #87159 — joshavant's session-file ownership (complementary, landed)
fix(agents): move session write lock into owned session runtime #87409 — moved hook locking into the controller (this PR rebased on top)
EmbeddedAttemptSessionTakeoverError cluster on 2026.5.22 — 13 events / 7 jobs / 42h #86845 — @htuyer121's corpus (pre-Fix embedded session file ownership race #87159; see corrected timeline above)

This PR is AI-assisted. Code authored with Claude.

clawsweeper · 2026-05-25T17:22:39Z

Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 11:01 AM ET / 15:01 UTC.

Summary
The branch replaces unconditional embedded session-fence refresh after message persistence with a pre-append fingerprint-gated owned-write publish and adds controller regression coverage.

PR surface: Source +126, Tests +60. Total +186 across 5 files.

Reproducibility: yes. source-level. Current main unconditionally refreshes the fence after persistence, and the PR's terminal harness plus checked-in tests model the F0 to external F1 to pi F2 mixed interleave.

Review metrics: 1 noteworthy metric.

Session trust boundary: 1 fail-open refresh replaced; 3 trust-gate regression cases added. The measured change is small but directly affects which session fingerprints are accepted as owned before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

[P1] Have an agents/session owner explicitly accept the fail-closed trust-boundary change.
Refresh exact-head checks if main moves again before landing.

Risk before merge

[P1] Merging changes which session-file fingerprints are trusted as owned during an embedded attempt; a wrong trust decision could hide an external transcript mutation or leave same-lane false takeover behavior unfixed.
[P1] The corrected production corpus is pre-Fix embedded session file ownership race #87159, so the post-Fix embedded session file ownership race #87159 same-lane value rests on source-level proof and checked-in regression coverage rather than live recurrence.

Maintainer options:

Accept Boundary With Owner Signoff (recommended)
Have an agents/session owner explicitly accept the fail-closed trust decision and land with the current head checks.
Ask For Corpus Replay
If maintainers want stronger proof, replay the offered anonymized JSONL corpus while noting it cannot isolate a post-Fix embedded session file ownership race #87159 same-lane-only failure.
Pause If Ownership Fix Is Enough
If maintainers decide Fix embedded session file ownership race #87159 fully owns the remaining failure class, pause or close this hardening PR and keep newer live reports canonical.

Next step before merge

No automated repair is indicated; the remaining action is maintainer acceptance of the session-state trust-boundary change.

Security
Cleared: No concrete security or supply-chain concern was found; the diff does not add dependencies, CI execution, secret handling, or broader permissions.

Review details

Best possible solution:

Land the fail-closed publish boundary only after agents/session owner acceptance, keeping the regression coverage and the linked canonical issue open until the PR merges.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level. Current main unconditionally refreshes the fence after persistence, and the PR's terminal harness plus checked-in tests model the F0 to external F1 to pi F2 mixed interleave.

Is this the best way to solve the issue?

Yes, likely. Gating at the publish boundary is narrower than hoisting the ALS scope because it uses the session-manager append seam and avoids broadly classifying unrelated listener writes as owned.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 6c7644268f59.

Label changes

Label justifications:

P1: The PR targets an embedded-agent session-state failure mode that can interrupt scheduled or channel-driven agent workflows.
merge-risk: 🚨 session-state: The diff changes the trusted fingerprint boundary for session JSONL writes during embedded attempts.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body supplies after-fix terminal output from a source-level runtime harness with real temp session JSONL files, plus checked-in regression and changed-gate evidence in comments.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix terminal output from a source-level runtime harness with real temp session JSONL files, plus checked-in regression and changed-gate evidence in comments.

Evidence reviewed

PR surface:

Source +126, Tests +60. Total +186 across 5 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	4	139	13	+126
Tests	1	78	18	+60
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	5	217	31	+186

What I checked:

Root policy read: Root AGENTS.md and scoped agent runner AGENTS.md were read; they require whole-path review for agent/session changes and favor focused embedded-runner helper tests for this surface. (AGENTS.md:1, 6c7644268f59)
Current main behavior: Current main still wires every persisted message to refreshAfterOwnedSessionWrite, which refreshes the active fence from the current session file fingerprint after the append. (src/agents/embedded-agent-runner/run/attempt.ts:1988, 6c7644268f59)
PR publish boundary: The PR captures beforeMessagePersist immediately before originalAppend and passes that snapshot to publishOwnedPostMessageWrite after persistence. (src/agents/session-tool-result-guard.ts:614, b06e2b72d0d8)
PR trust gate: publishOwnedPostMessageWrite records the post-write fingerprint only when the pre-write state matches the active fence, a benign advance, or a previously trusted fingerprint; otherwise it skips the publish. (src/agents/embedded-agent-runner/run/attempt.session-lock.ts:954, b06e2b72d0d8)
Regression coverage: The PR adds three trust-gate tests for skipped publish, trusted publish, and mixed external-then-pi interleave, matching the source-level failure mode. (src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts:1469, b06e2b72d0d8)
Latest checks: GitHub check runs for head b06e2b7 show 138 success, 24 skipped, and 1 neutral CodeQL configuration warning; no failed check runs were found in the paginated API result. (b06e2b72d0d8)

Likely related people:

vincentkoc: Authored the embedded session write-fence commit that introduced the current fingerprint/takeover behavior this PR changes. (role: feature owner; confidence: high; commits: 2bb00f6726d4; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts)
joshavant: Authored the merged session-file ownership race fix that this PR explicitly complements. (role: adjacent feature owner; confidence: high; commits: 3349fe21bbde; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts, src/agents/embedded-agent-runner/run/attempt.ts)
steipete: Authored recent merged work moving the session write-lock seam into the owned session runtime and the broad agent-runtime internalization that renamed this path. (role: recent refactor author; confidence: high; commits: 5f68291f4f54, bb46b79d3c14; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts, src/agents/embedded-agent-runner/run/attempt.ts, src/agents/sessions/agent-session.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

ubehera · 2026-05-25T17:27:11Z

@clawsweeper re-review

PR body updated with harness output (run-refresh-before-hook-lock-proof.mjs) showing deterministic before/after on a real temp session JSONL: pre-fix throws EmbeddedAttemptSessionTakeoverError, post-fix hook fires cleanly with hasSessionTakeover()=false.

clawsweeper · 2026-05-25T17:27:14Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26412335751
Updated: 2026-05-25T17:44:42.441Z

clawsweeper · 2026-05-25T17:34:58Z

ClawSweeper PR egg

✨ Hatched: 🌱 uncommon Gilded Proofling

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

Merged PRs are hatchable.
Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🌱 uncommon.
Trait: polishes edge cases.
Image traits: location diff observatory; accessory release bell; palette pearl, teal, and neon green; mood curious; pose peeking out from the egg shell; shell frosted glass shell; lighting moonlit rim light; background soft code-shaped tiles.
Share on X: post this hatch
Copy: My PR egg hatched a 🌱 uncommon Gilded Proofling in ClawSweeper.

What is this egg doing here?

Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

ubehera · 2026-05-25T17:35:17Z

Heads-up: the checks-node-agentic-agents failure on this PR is a pre-existing flake on main, not caused by this PR.

The failing test is src/agents/model-catalog-visibility.test.ts:29 > resolveVisibleModelCatalog > limits visible catalog to provider wildcard entries after default discovery — Error: Test timed out in 120000ms.

The same test, same line, same timeout fails on main CI run 26412030433 (push fix(security): audit Claude permission overrides under YOLO (#86557)) at 17:18Z — 13 minutes before this PR's run. The most recent successful main run (#86571 at 17:24Z) happens to pass it; the flake is timing-dependent.

This PR touches src/agents/pi-embedded-runner/run/attempt.session-lock.ts (+1 line) and the matching test file — completely separate code path from the model-catalog-visibility test. Local node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts exits 0.

Per AGENTS.md "If [a failing check is] unrelated on latest origin/main, say so with scoped proof" — flagging this so it doesn't become a maintainer-side speed bump.

…gated) Replaces the prior refreshBeforeLock approach (91ba92c) per ClawSweeper review on PR openclaw#86584. The earlier design used a blanket refreshAfterOwnedSessionWrite() right before each external-hook lock acquisition; that approach could silently accept genuine external mutations because it advanced the fence without proving the new state was the lane's own write. New approach: route pi's _persist writes through the existing ownedSessionFileWrites machinery instead of blind fence refreshes. - Add EmbeddedAttemptSessionLockController.publishOwnedPostMessageWrite() (sync). Called from the onMessagePersisted callback right after pi's sessionManager.appendMessage -> _persist -> appendFileSync completes. Records the post-write fingerprint as an OWNED write in the process-shared ownedSessionFileWrites map so subsequent assertSessionFileFence calls accept the lane's own writes via the owned-write match path. - Trust-gate: publishOwnedPostMessageWrite only records the new state when the prior fenceFingerprint is in trustedSessionFileStates (set at releaseForPrompt time). External mutations that bypass the callback are never recorded as owned and continue to trip EmbeddedAttemptSessionTakeoverError correctly. - Revert installSessionExternalHookWriteLock's refreshBeforeLock param and the shared waitBeforeLock closure. Each installLockableFunction call goes back to waitBeforeLock: () => waitForSessionEventQueue(session) directly. - runEmbeddedAttempt's onMessagePersisted callback now calls sessionLockController.publishOwnedPostMessageWrite() instead of refreshAfterOwnedSessionWrite(). Tests (5 cases under "embedded attempt session lock lifecycle"): - "accepts pi-style writes published via publishOwnedPostMessageWrite before beforeToolCall fires" — positive case. - "trips takeover on a same-file external write that bypasses publishOwnedPostMessageWrite" — NEGATIVE case the bot explicitly required ("add a negative test for an external same-file append before a hook lock"). Verifies fail-closed invariant preserved. - "allows multiple prompt turns with pi-style writes published per turn" — multi-turn case, three iterations, no false takeover. - "does not classify a different session file's writes as owned by this controller" — cross-file isolation. - "does not hang when reacquireAfterPrompt rejects after a pi-style direct write" — abort/error path. 78 cases across 2 test files green; pnpm check:changed clean. Closes openclaw#86572.

ubehera · 2026-05-25T18:12:22Z

Redesigned per ClawSweeper review on 91ba92c9ab (force-pushed to d824be0298).

Change in approach:

Old (91ba92c9ab): installSessionExternalHookWriteLock got a refreshBeforeLock param that called refreshAfterOwnedSessionWrite() before each hook acquired the lock. That blanket-refreshed the fence even for genuine external mutations, weakening takeover detection.
New (d824be0298): A new publishOwnedPostMessageWrite() method is wired into onMessagePersisted (the existing pi _persist callback point). It records the post-write fingerprint as owned in ownedSessionFileWrites, gated on isTrustedSessionFileState(sessionFileFenceKey, fenceFingerprint). External mutations that bypass the callback never get published; the fence still trips on them via assertSessionFileFence's existing owned-write match path.

Addresses bot guidance verbatim:

"Please only advance the fence for writes proven to be this controller's own pi/session-manager persistence, and add a negative test for an external same-file append before a hook lock."

Trust gate: The new method only records when the prior fenceFingerprint is in trustedSessionFileStates (set at releaseForPrompt time). Same-file external mutations cannot bypass this — they don't go through onMessagePersisted.
Negative test added: trips takeover on a same-file external write that bypasses publishOwnedPostMessageWrite — verifies the fail-closed invariant.

Proof: Three-scenario harness covers pre-fix bug, post-fix success, and the negative case. Verbatim output in the PR body.

@clawsweeper re-review

clawsweeper · 2026-05-25T18:12:26Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26413924744
Updated: 2026-05-25T18:22:25.140Z

ubehera · 2026-05-25T18:25:46Z

CI status: checks-node-core-runtime-cron-service fails on the unrelated test src/cron/service/timer.regression.test.ts:989 — records per-job start time and duration for batched due jobs. Assertion: expected 70 to be 50 // Object.is equality.

This is a timing-sensitive flake in the cron-service batched-due-jobs path (uses a mutable let now shared between two synchronous mock job runs). It has no dependency on the files this PR touches (src/agents/pi-embedded-runner/run/attempt.session-lock.ts, attempt.session-lock.test.ts, attempt.ts).

All other checks pass on d824be0298, including checks-node-agentic-agents (which covers the embedded-runner test surface this PR modifies).

ubehera · 2026-05-25T19:59:12Z

Force-pushed 584ba0e65b addressing ClawSweeper's correctness finding on d824be0298.

The bot's exact concern, addressed:

"This records the current fingerprint as owned whenever the previous fence fingerprint was trusted. If another lane/process appends to the same session file after releaseForPrompt() and then this lane appends its pi message before onMessagePersisted() runs, current contains both writes; recording that combined fingerprint makes the later hook-lock fence accept the external mutation as owned."

Fix shape:

publishOwnedPostMessageWrite() now takes a beforeWrite: SessionFileFingerprint | undefined argument and gates trust on the pre-append fingerprint, not on the current fenceFingerprint.
The pre-append fingerprint is captured by a new beforeMessagePersist hook in session-tool-result-guard, invoked before originalAppend(message). The snapshot flows back to the caller through onMessagePersisted's new context arg (typed unknown to keep the guard module-agnostic).
runEmbeddedAttempt wires beforeMessagePersist: () => readSessionFileFingerprintSync(params.sessionFile) and forwards the snapshot to publishOwnedPostMessageWrite(beforeWriteSnapshot).

Mixed-interleaving negative test added:

it("trips takeover on a mixed external-then-pi append (publish refuses to launder external mutation)", async () => {
  await controller.releaseForPrompt();                            // F0 trusted
  await fs.appendFile(sessionFile, '...external...', "utf8");     // F0 -> F1
  const beforeWrite = readSessionFileFingerprintSync(sessionFile);// F1
  await fs.appendFile(sessionFile, '...pi-after-external...');    // F1 -> F2
  controller.publishOwnedPostMessageWrite(beforeWrite);           // F1 NOT trusted: skip
  await expect(session.agent.beforeToolCall()).rejects.toBeInstanceOf(
    EmbeddedAttemptSessionTakeoverError,                          // fail closed
  );
});

Proof harness updated to a 4-scenario harness; Scenario D is the mixed-interleave fail-closed verification. All 4 PASS. Verbatim stdout in the PR body.

80 tests across the session-lock surface green (+1 from the new mixed-interleave test). 46 cases across the session-tool-result-guard surfaces also green — no regressions from the new beforeMessagePersist opt.

@clawsweeper re-review

clawsweeper · 2026-05-25T19:59:14Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26417580857
Updated: 2026-05-25T20:09:16.503Z

ubehera · 2026-05-29T12:57:54Z

@clawsweeper re-review

PR body has been refreshed since the last review pass:

Rebased onto current upstream/main (head d780c41bf1, post-Fix embedded session file ownership race #87159 + fix(agents): move session write lock into owned session runtime #87409)
Real behavior proof CI check is now passing (it was previously flagging the body's code-block formatting; reformatted to column 0)
Centered @htuyer121's #86845 production corroboration as the live-install evidence layer
Reframed architecturally as defense-in-depth complementary to Fix embedded session file ownership race #87159 (orthogonal invariants: cross-lane vs same-lane)

Asking for fresh consideration of the rating now that the Real behavior proof gate is green.

clawsweeper · 2026-05-29T12:57:58Z

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26638611330
Updated: 2026-05-29T13:05:16.808Z

@jalehman

* fix(agents): skip fallback for session coordination errors Preserve provider fallback metadata when session coordination errors are nested under provider failures. Co-authored-by: luyao618 <364939526@qq.com> (cherry picked from commit 6a5a135) * fix(agents): tolerate in-process session writes during prompt release (openclaw#84250) Merged via squash. Prepared head SHA: 33f88fe Co-authored-by: tianxiaochannel-oss88 <272340815+tianxiaochannel-oss88@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman (cherry picked from commit 1b77145) * fix(agents): bound embedded compaction write locks Fixes the embedded attempt session write-lock watchdog so the fallback max hold time follows the resolved compaction timeout plus the existing lock grace window, instead of inheriting the full run timeout. Adds regression coverage for the helper and settled-compaction lock lifecycle, plus a changelog entry thanking @luoyanglang. Verification: - `pnpm test src/agents/session-write-lock.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts` - `pnpm check:changed` via Blacksmith Testbox `tbx_01ks8b6vn8se5cg1dfn3te3g47` / https://github.com/openclaw/openclaw/actions/runs/26301988670 - Autoreview clean: `/Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode branch --base origin/main` - PR CI green on `79e8c5f1a637981d263c0268bf5666967ff4e778`: https://github.com/openclaw/openclaw/actions/runs/26302152844 and https://github.com/openclaw/openclaw/actions/runs/26302152798 Co-authored-by: luoyanglang <hanwanlonga@gmail.com> (cherry picked from commit 46de078) * fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition (openclaw#85764) * fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition - Adds optional maxHoldMs parameter to inspectLockPayload - Inspect now marks locks as stale when held longer than maxHoldMs - Passes maxHoldMs through inspectLockPayloadForSession - acquireSessionWriteLock's shouldReclaim callback now passes maxHoldMs This ensures that when a live process holds a lock for longer than maxHoldMs (default 5min), other processes can reclaim it during acquisition — matching the watchdog's existing enforcement. Previously shouldReclaim only used staleMs (30min default), meaning a lock held for 10+ minutes by a live PID would never be reclaimable, causing 60s timeout failures and gateway freezes. Closes openclaw#85762 * fix(session-lock): add dead-PID fast-path before retry loop Adds a fast-path check at the top of acquireSessionWriteLock: if the lock file's owner PID is dead, remove it immediately before entering the retry loop. This saves up to timeoutMs (60s) of futile waiting when the previous lock holder has died. The shouldReclaim callback already handles this case, but only iteratively through the retry loop. The fast-path eliminates that unnecessary delay. * fix(session-lock): enforce max hold during acquisition * fix(session-lock): revalidate max hold safely * fix(session-lock): honor holder max-hold policy * fix(session-lock): keep cleanup from reclaiming live holders * fix(session-lock): remove stale locks only when unchanged * fix(session-lock): skip self-held max-hold reclaim * fix(ci): refresh gateway protocol checks --------- Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com> Co-authored-by: Peter Steinberger <steipete@gmail.com> (cherry picked from commit a1eb765) * fix(embedded-runner): preserve provider errors on cleanup takeover (openclaw#84321) Summary: - The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions. - PR surface: Source +52, Tests +92. Total +144 across 5 files. - Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback - PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405… Validation: - ClawSweeper review passed for head 050c779. - Required merge gates passed before the squash merge. Prepared head SHA: 050c779 Review: openclaw#84321 (comment) Co-authored-by: abnershang <abner.shang@gmail.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> (cherry picked from commit 7fbca96) * fix(agents): release embedded-attempt session lock on every exit path (openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com> (cherry picked from commit 32ddfc2) * fix(agents): fence yield abort lock release (cherry picked from commit 0fe7479) * fix(agents): memoize session lock owner args Memoize owner process argv lookups per PID during `cleanStaleLockFiles`, and yield between lock entries so startup cleanup does not monopolize the event loop while inspecting many session locks. This keeps lock classification semantics unchanged while avoiding repeated synchronous process-args reads for lock clusters owned by the same PID, especially the Windows PowerShell path. Fixes openclaw#86509. Verification: - `git diff --check origin/main...HEAD` - focused TSX harness against the current-main merge result: `session-lock memo regression harness passed` Thanks @openperf. Co-authored-by: openperf <16864032@qq.com> (cherry picked from commit c430fcd) * fix(diagnostics): recover orphaned session activity Recover idle queued sessions whose diagnostic activity retained stale ownerless model or tool calls by classifying them as recoverable session.stuck after the usual recovery gates. Yield the event loop before stale session-lock process inspection so sync process lookup cannot monopolize lock contention paths. Docs now describe the widened session.stuck telemetry contract for recoverable stale bookkeeping, including ownerless activity. Thanks @samuelsoaress. Refs openclaw#84903. Co-authored-by: samuelsoaress <samuelsoares177778@gmail.com> (cherry picked from commit 286964c) * [FORK][openclaw#86584] gate owned-write publish on pre-append fingerprint (fixes openclaw#86572) Carries unmerged upstream PR openclaw#86584 (HEAD d79a3b4) onto the boon 5.18 base as the same-lane EmbeddedAttemptSessionTakeoverError fence fix for long cron turns. Fails closed: an external mutation before pi's append fails the trust gate and still trips the fence (verified by the PR's 303-line test suite incl. the mixed-interleave negative test). Backfills base symbols openclaw#86584 assumes (introduced upstream between 5.18 and the PR base, not carried by the 9 merged race-fix picks): - session-lock.ts: MAX_BENIGN_SESSION_FENCE_{ADVANCE,REWRITE,REWRITE_RESULT}_BYTES, MAX_SAFE_FILE_OFFSET, TRANSCRIPT_ONLY_OPENCLAW_ASSISTANT_MODELS, SessionFileFenceSnapshot type, fenceSnapshot state var, ActiveWriteLockState type + activeWriteLock store fix (reuse nested writes via {active:true}), node:util + string-normalization imports. - transcript-append.ts: wrap appendSessionTranscriptMessage in runWithOwnedSessionTranscriptWriteLock so low-level appends acquire the owned-context lock. - test import fixes (appendSessionTranscriptMessage, withOwned/bindOwned, __testing). Drop when upstream merges openclaw#86584. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [FORK][openclaw#86584] wire owned-transcript-write context + typecheck cleanup CRITICAL: wrap promptActiveSession in withOwnedSessionTranscriptWrites and bind onBlockReply/onBlockReplyFlush to the owned context in attempt.ts. Without this, pi's own transcript appends during a prompt are NOT recorded as owned, so the fence trips on them (the exact takeover the backport is meant to prevent). This wiring is an intermediate-base feature (between 5.18 and openclaw#84250's base) the merged picks didn't carry. Tests passed before only because they set the context manually. Also: add releaseHeldLockForAbort to the controller type; drop incidental non-fence suppressAssistantErrorPersistence passes; remove dead async benign-rewrite cluster (sessionFence{Advance,Rewrite}IsBenign + readAppendedSessionFileText + lineMatchesLinearTranscriptMigration + helpers) — our openclaw#84250-based assertSessionFileFence uses the sync owned-write path, so the async benign-detection variants are unreachable. tsgo core: 0 errors. 384 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [FORK][openclaw#86584] address codex review: prefix-validate benign advance + preserve provider error Finding 2 (masking gap, P2): sessionFenceAdvanceIsBenignSync only validated the APPENDED bytes, so a writer that rewrote the existing prefix AND appended a benign delivery-mirror/gateway-injected line could be laundered as an owned advance — masking a genuine external takeover (silent message loss). Now fail closed unless the current prefix is byte-identical to the trusted readSessionFileFenceSnapshot text (readSessionFilePrefixSync); absent snapshot text => not benign. Finding 1 (provider-error masking, P2): wrappedStreamFn's finally let a reacquireAfterPrompt() takeover error mask the original provider error when the stream itself threw. Now only surface the reacquire error when the stream succeeded; otherwise preserve the original failure. tsgo core: 0 errors. 384 tests pass (benign-advance acceptance + external-mutation rejection both green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(release): 2026.5.18-boon.1 — session-takeover hardening (boon fleet build) Version bump + CHANGELOG for the fork build. Also fixes a backport test-import gap: attempt.test.ts referenced `attemptTesting` (the __testing export) without importing it. Full project typecheck (tsgo -b tsconfig.projects.json): 0 errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ci): no-unsafe-finally in wrappedStreamFn + drop collateral protocol/test churn - wrappedStreamFn: restructure provider-error-preservation without a throw inside finally (oxlint no-unsafe-finally). Same semantics: always reacquire; prefer the original stream error over a reacquire takeover error; surface reacquire error only when the stream succeeded. - Revert src/gateway/server-methods/agent.test.ts + GatewayModels.swift to the 5.18 baseline: the openclaw#85764 cherry-pick conflict-resolution had pulled in openclaw#85256-era internal-session-effect tests + protocol fields whose implementation isn't in this backport, breaking checks-node-agentic-gateway-methods + checks-fast-bundled-protocol. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: remove vestigial onAssistantErrorMessagePersisted option decls Address cubic P2 review (PR #2): the option was declared on the guard and guard-wrapper option types but never forwarded or invoked, so any provided callback was silently ignored. The companion error-suppression feature (suppressAssistantErrorPersistence + the agent-runner/followup caller chain) is deliberately scoped OUT of this 5.18 backport, so the decls were dead plumbing left over from a cherry-pick. Remove them to keep the option surface honest; the load-bearing beforeMessagePersist fence checkpoint (openclaw#86572) is retained. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yao <364939526@qq.com> Co-authored-by: xiaotian <tianxiaochannel@gmail.com> Co-authored-by: 狼哥 <hanwanlonga@gmail.com> Co-authored-by: njuboy <njuboy11@gmail.com> Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com> Co-authored-by: Peter Steinberger <steipete@gmail.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: abnershang <abner.shang@gmail.com> Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> Co-authored-by: Chunyue Wang <80630709+openperf@users.noreply.github.com> Co-authored-by: openperf <16864032@qq.com> Co-authored-by: Samuel Soares da Silva <samuelsoares177778@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-pick

…squashed)

ubehera · 2026-06-02T08:30:18Z

@clawsweeper re-review

Force-pushed 43baac80c5 — rebased onto current upstream/main and added the checked-in regression coverage flagged in the last pass (the "0 new test cases added in the latest diff" metric / maintainer option "Ask For Checked-In Regression Coverage").

Added (src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts): a new describe("publishOwnedPostMessageWrite trust gate (#86572)") block with 3 cases that drive the publish path directly:

skips publish → trips — a same-lane write that never calls publishOwnedPostMessageWrite still trips the fence (fail-closed default; publish is the required opt-in).
publish + trusted baseline → accepts — pre-append fingerprint F0 is trusted, so the lane's own write is recorded as owned and the hook fires cleanly (hasSessionTakeover() === false).
mixed interleave → still trips — an external write advances F0→F1 first, so the captured baseline (F1) was never trusted; publishOwnedPostMessageWrite refuses to record the combined F2 state and the takeover fence still fires. This is the same-lane/mixed-interleave scenario from the proof harness, now checked in.

Harness Scenario C (pure external advance) is intentionally not duplicated — it is already covered by the existing rejects post-prompt writes when another owner advances the session file test.

Verification on the rebased branch (43baac80c5):

node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts → 45/45 pass.
Full touched surface (attempt.session-lock + session-tool-result-guard{,.transcript-events,.tool-result-persist-hook}) → 113/113 pass.
oxlint exit 0; oxfmt --check clean.
Mutation-checked for teeth: test 2 fails if the owned-write recording is removed; test 3 fails if the fail-closed trust gate is removed — confirming each test guards the fix rather than restating current behavior.

No source behavior changed in this push — the diff vs the prior head is the rebase plus the three test cases.

clawsweeper · 2026-06-02T08:30:21Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26808434401
Updated: 2026-06-02T08:43:37.445Z

ubehera · 2026-06-02T09:35:44Z

Latest-head broad gate — pnpm check:changed green on Linux

Ran the current-head broad changed gate (the latest-head verification flagged in the last review) on 43baac80c5, via crabbox local-container (Linux, ubuntu:26.04, Node 24), scoped to this PR's 5 changed files:

node scripts/crabbox-wrapper.mjs run --provider local-container -- \
  env CI=1 OPENCLAW_TEST_PROJECTS_PARALLEL=6 OPENCLAW_VITEST_MAX_WORKERS=1 \
  OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=900000 pnpm check:changed

Result: exit=0, 0 failures (command 3m08s / total 5m36s). Changed-gate scope confirmed = exactly the 5 changed files (lanes=core, coreTests); no unrelated drift. Green steps:

typecheck core + typecheck core tests
lint core changed files — Found 0 warnings and 0 errors
dependency pin guard (464 specs, 0 violations), package patch guard
conflict markers, changelog attributions, plugin-sdk / extension wildcard re-export guards, duplicate-coverage, runtime import cycles, media / sidecar / webhook / pairing guards

Test execution (separate from check:changed): focused run-vitest on the touched surface is green — attempt.session-lock.test.ts 45/45 (including the 3 new publishOwnedPostMessageWrite trust-gate cases) plus the three session-tool-result-guard suites; 103 tests across 4 files. All GitHub checks on 43baac80c5 are green as well.

Proof note: this ran on a local-container Linux runner (Node 24, CI-matching), not the brokered runner. A maintainer can re-run the brokered broad gate for the canonical environment if desired.

htuyer121 · 2026-06-02T12:48:57Z

Timeline correction on the production corroboration attributed to me — my 30-event corpus is PRE-#87159 in our install, not "post-#87159 still firing."

Thanks for carrying the production data. One clarification I should make before it weighs on the merge, because the current PR body reads our corpus as the class "continues firing in real installs after #87159 landed" / "post-#87159 code" — that isn't accurate for our install:

Fix embedded session file ownership race #87159 merged upstream 05-27, but it didn't reach us until we upgraded to v2026.5.27 on 2026-05-29. All 30 of our EmbeddedAttemptSessionTakeoverError events were on the prior 2026.5.26 build (no Fix embedded session file ownership race #87159) — the last one 2026-05-27 11:03 UTC.
Since deploying Fix embedded session file ownership race #87159 (05-29) the signature is zero in our install — ~6 days now, still 20 distinct on-disk events total (all pre-upgrade), verified three ways against the raw cron ledgers. Former hot-spot jobs now run clean through to delivery, so the fence is being exercised and holding.

So our corpus documents the class while it was active pre-#87159; it isn't evidence that it persists after #87159. The "rate accelerating / spread widening" in my 05-28 comment was the tail of the pre-upgrade period and shouldn't be read as current.

What our data can't do is corroborate that the same-lane path this PR targets is still firing post-#87159 — once the cross-lane boundary is closed, our production window can't isolate a same-lane-only race. So I'd weight this PR on its own source-level proof (the 4-scenario harness — A trips pre-fix, B clean post-fix, C/D fail-closed — independently demonstrates the same-lane race and the trust gate's fail-closed behavior), rather than on our corpus as a post-#87159 signal. The 30 anonymized session .jsonl files (all pre-#87159) remain available if a diff-driven repro is still useful.

…t to publish path (openclaw#86572)

ubehera · 2026-06-02T14:54:04Z

Thanks @htuyer121 — appreciate the careful correction, and you're right. I've updated the PR body accordingly: the 30-event corpus is now described as pre-#87159 (build 2026.5.26), and the "continues firing after #87159" framing is gone. The "Production context" section now leads with your correction — zero events over ~6 days post-#87159, and the point that a same-lane-only race can't be isolated from production once #87159 closes the cross-lane boundary — and explicitly weights the PR on its source-level proof rather than the corpus, per your recommendation.

For clarity on what the PR now claims: defense-in-depth hardening of the same-lane publish boundary (fail-open unconditional fence refresh → fail-closed trust gate), not a fix for an actively-firing production class. The meaningful source-level scenario is the mixed-interleave case (an external mutation interleaved with the lane's own write being laundered into the fence), which the trust gate refuses to record.

The offer of the anonymized .jsonl files is still genuinely useful for a future diff-driven repro — thank you for that.

ubehera · 2026-06-02T14:54:06Z

@clawsweeper re-review

One cleanup commit plus a PR-body correction since the last pass on 43baac80c5. New head b06e2b72d0 (fast-forward, not a force-push — the reviewed feature commit 43baac80c5 is untouched).

1. Removed dead refreshAfterOwnedSessionWrite (method + interface entry) — src/agents/embedded-agent-runner/run/attempt.session-lock.ts

This is upstream main's method (introduced in 2bb00f6726 "fix(agents): fence embedded session writes", carried through the #85341 rename). Its only caller was onMessagePersisted, which this PR's core fix already rewires to publishOwnedPostMessageWrite(beforeWriteSnapshot) (attempt.ts). Superseding that single call site left the method with zero production references — rg -uu (including generated/ignored files) confirms no remaining source reference on this branch; the only hits are regenerated build artifacts and git internals. Removing it completes the refactor to one canonical path: the fail-open unconditional refresh is exactly the behavior this PR replaces, so leaving it defined would strand a dead fail-open path beside the new fail-closed gate. Not a plugin-sdk export (the type is internal; its only structural consumer, attempt-abort.ts, uses Pick<…, "releaseHeldLockForAbort">, unaffected).

2. Reworked the two tests that referenced it — attempt.session-lock.test.ts

Rewrote "keeps post-provider transcript writes owned after prompt stream returns" onto the production method publishOwnedPostMessageWrite(beforeWrite), capturing the pre-append fingerprint exactly as beforeMessagePersist does. This preserves its unique post-reacquireAfterPrompt coverage and is mutation-sensitive: if the publish were a no-op, acquireForCleanup's assertSessionFileFence sees current=F1 ≠ fence=F0, trips, and the test fails. Added a comment marking it as the deliberate post-reacquire path.
Deleted "refreshes the prompt fence after an owned session manager append" as redundant — it tested the removed method, and its production-API analog is the existing "accepts a same-lane write published with a trusted pre-append fingerprint" trust-gate case.

3. PR body corrected (no code impact)

Removed the inaccurate "@htuyer121's corpus shows the class continues firing after #87159" framing per their correction (corpus is pre-#87159; zero events post-#87159). Reframed the headline as fail-open → fail-closed publish-boundary hardening, and added a Fix shape vs. #86572 note (publish boundary, not the proposed ALS-scope hoist) plus a Known limitation note (the narrow pre-append window, owned by #87159's file-ownership boundary).

Verification on b06e2b72d0:

node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts → 44/44 (was 45; −1 from the redundant-test deletion).
oxlint exit 0; oxfmt --check clean.
crabbox local-container pnpm check:changed (Ubuntu 26.04, Node 24) → exit=0 (command 3m16s / total 5m40s): typecheck core + core tests clean, lint 0 warnings / 0 errors on the 5 changed files, 0 runtime import cycles, all guards (dependency pins, package patches, media/sidecar/webhook/pairing) pass.

Net −23 LOC (source −7, tests −16). No production behavior change vs the prior head — this commit removes a now-dead path that the fix had already superseded and refreshes its coverage onto the production seam.

clawsweeper · 2026-06-02T14:54:09Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26828060667
Updated: 2026-06-02T15:01:46.227Z

sparkli1022-lab · 2026-06-05T17:00:04Z

Coverage note for the publish-boundary fix shape.

A runtime-owned custom session entry can advance the same session file during the prompt-lock release window before the next message append.

Observed ordering from a packaged openclaw@2026.6.1 Docker gateway run:

thinking_level_change
custom entry model-snapshot
user message
subsequent fence check raised EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released

If the owned-write publish path is applied only to sessionManager.appendMessage, the later user-message append captures a pre-write fingerprint that already includes the unpublished custom entry. The fail-closed trust gate then refuses to publish the message append, so the fence still trips.

In this scenario, the publish-boundary coverage needs to include runtime-owned custom session entries as well as message entries, or those custom writes need an equivalent ownership/locking path so they cannot advance the session file unrecorded during the prompt-lock release window.

openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 25, 2026

ubehera mentioned this pull request May 25, 2026

fix(agents): file-scoped prompt-window guard for same-session embedded races #86067

Closed

ubehera mentioned this pull request May 25, 2026

fix(agents): drain prompt-emitted session events before fence reacquire (#85913) #86585

Closed

openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 25, 2026

clawsweeper Bot added P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 25, 2026

openclaw-barnacle Bot added size: M and removed size: S labels May 25, 2026

ubehera changed the title ~~fix(agents): refresh fence fingerprint before external-hook write lock (#86572)~~ fix(agents): publish pi's owned writes via onMessagePersisted (#86572) May 25, 2026

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026

clawsweeper Bot added status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 25, 2026

ubehera changed the title ~~fix(agents): publish pi's owned writes via onMessagePersisted (#86572)~~ fix(agents): gate owned-write publish on pre-append fingerprint (#86572) May 25, 2026

tonga54 added a commit to arkimedes-dev/openclaw that referenced this pull request May 30, 2026

docs(arkimedes): note vendored openclaw#86584 session-takeover cherry…

63ed61a

…-pick

xuess mentioned this pull request May 30, 2026

[Bug]: isolated cron can still self-conflict on a dedicated cron agent with EmbeddedAttemptSessionTakeoverError #88369

Open

KanopiAndPetra mentioned this pull request Jun 1, 2026

EmbeddedAttemptSessionTakeoverError — 6-agent cron cluster, sessionKey binding pattern, cross-lane vs same-lane #89256

Closed

fix(agents): pre-append trust gate via beforeMessagePersist publish (…

43baac8

…squashed)

ubehera force-pushed the fix/session-takeover-refresh-before-hook-lock branch from d780c41 to 43baac8 Compare June 2, 2026 08:29

openclaw-barnacle Bot added size: M and removed size: S proof: sufficient ClawSweeper judged the real behavior proof convincing. labels Jun 2, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026

chepurko mentioned this pull request Jun 2, 2026

Dreaming narrative generation always falls back; DREAMS.md never gets real entries #81465

Open

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026

refactor(agents): drop dead refreshAfterOwnedSessionWrite; rewire tes…

b06e2b7

…t to publish path (openclaw#86572)

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026

Uh oh!

Conversation

ubehera commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix shape (vs. #86572's proposal)

Rebase note (force-pushed, head 43baac80c5)

Production context — corrected

Real behavior proof

Compatibility / Risk

Related

Uh oh!

clawsweeper Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubehera commented May 25, 2026

Uh oh!

clawsweeper Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hatch command

Uh oh!

ubehera commented May 25, 2026

Uh oh!

ubehera commented May 25, 2026

Uh oh!

clawsweeper Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubehera commented May 25, 2026

Uh oh!

ubehera commented May 25, 2026

Uh oh!

clawsweeper Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubehera commented May 29, 2026

Uh oh!

clawsweeper Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubehera commented Jun 2, 2026

Uh oh!

clawsweeper Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubehera commented Jun 2, 2026

Uh oh!

htuyer121 commented Jun 2, 2026

Uh oh!

ubehera commented Jun 2, 2026

Uh oh!

ubehera commented Jun 2, 2026

Uh oh!

clawsweeper Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sparkli1022-lab commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ubehera commented May 25, 2026 •

edited

Loading

Rebase note (force-pushed, head `43baac80c5`)

clawsweeper Bot commented May 25, 2026 •

edited

Loading

clawsweeper Bot commented May 25, 2026 •

edited

Loading

clawsweeper Bot commented May 25, 2026 •

edited

Loading

clawsweeper Bot commented May 25, 2026 •

edited

Loading

clawsweeper Bot commented May 25, 2026 •

edited

Loading

clawsweeper Bot commented May 29, 2026 •

edited

Loading

clawsweeper Bot commented Jun 2, 2026 •

edited

Loading

clawsweeper Bot commented Jun 2, 2026 •

edited

Loading

sparkli1022-lab commented Jun 5, 2026 •

edited

Loading