fix(agents): gate owned-write publish on pre-append fingerprint (#86572)#86584
fix(agents): gate owned-write publish on pre-append fingerprint (#86572)#86584ubehera wants to merge 2 commits into
Conversation
|
Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 11:01 AM ET / 15:01 UTC. Summary PR surface: Source +126, Tests +60. Total +186 across 5 files. Reproducibility: yes. source-level. Current main unconditionally refreshes the fence after persistence, and the PR's terminal harness plus checked-in tests model the F0 to external F1 to pi F2 mixed interleave. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the fail-closed publish boundary only after agents/session owner acceptance, keeping the regression coverage and the linked canonical issue open until the PR merges. Do we have a high-confidence way to reproduce the issue? Yes, source-level. Current main unconditionally refreshes the fence after persistence, and the PR's terminal harness plus checked-in tests model the F0 to external F1 to pi F2 mixed interleave. Is this the best way to solve the issue? Yes, likely. Gating at the publish boundary is narrower than hoisting the ALS scope because it uses the session-manager append seam and avoids broadly classifying unrelated listener writes as owned. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 6c7644268f59. Label changesLabel justifications:
Evidence reviewedPR surface: Source +126, Tests +60. Total +186 across 5 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
@clawsweeper re-review PR body updated with harness output ( |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
ClawSweeper PR egg ✨ Hatched: 🌱 uncommon Gilded Proofling Hatch commandComment Hatchability rules:
Rarity: 🌱 uncommon. What is this egg doing here?
|
|
Heads-up: the The failing test is The same test, same line, same timeout fails on main CI run This PR touches Per AGENTS.md "If [a failing check is] unrelated on latest origin/main, say so with scoped proof" — flagging this so it doesn't become a maintainer-side speed bump. |
…gated) Replaces the prior refreshBeforeLock approach (91ba92c) per ClawSweeper review on PR openclaw#86584. The earlier design used a blanket refreshAfterOwnedSessionWrite() right before each external-hook lock acquisition; that approach could silently accept genuine external mutations because it advanced the fence without proving the new state was the lane's own write. New approach: route pi's _persist writes through the existing ownedSessionFileWrites machinery instead of blind fence refreshes. - Add EmbeddedAttemptSessionLockController.publishOwnedPostMessageWrite() (sync). Called from the onMessagePersisted callback right after pi's sessionManager.appendMessage -> _persist -> appendFileSync completes. Records the post-write fingerprint as an OWNED write in the process-shared ownedSessionFileWrites map so subsequent assertSessionFileFence calls accept the lane's own writes via the owned-write match path. - Trust-gate: publishOwnedPostMessageWrite only records the new state when the prior fenceFingerprint is in trustedSessionFileStates (set at releaseForPrompt time). External mutations that bypass the callback are never recorded as owned and continue to trip EmbeddedAttemptSessionTakeoverError correctly. - Revert installSessionExternalHookWriteLock's refreshBeforeLock param and the shared waitBeforeLock closure. Each installLockableFunction call goes back to waitBeforeLock: () => waitForSessionEventQueue(session) directly. - runEmbeddedAttempt's onMessagePersisted callback now calls sessionLockController.publishOwnedPostMessageWrite() instead of refreshAfterOwnedSessionWrite(). Tests (5 cases under "embedded attempt session lock lifecycle"): - "accepts pi-style writes published via publishOwnedPostMessageWrite before beforeToolCall fires" — positive case. - "trips takeover on a same-file external write that bypasses publishOwnedPostMessageWrite" — NEGATIVE case the bot explicitly required ("add a negative test for an external same-file append before a hook lock"). Verifies fail-closed invariant preserved. - "allows multiple prompt turns with pi-style writes published per turn" — multi-turn case, three iterations, no false takeover. - "does not classify a different session file's writes as owned by this controller" — cross-file isolation. - "does not hang when reacquireAfterPrompt rejects after a pi-style direct write" — abort/error path. 78 cases across 2 test files green; pnpm check:changed clean. Closes openclaw#86572.
|
Redesigned per ClawSweeper review on Change in approach:
Addresses bot guidance verbatim:
Proof: Three-scenario harness covers pre-fix bug, post-fix success, and the negative case. Verbatim output in the PR body. @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
CI status: This is a timing-sensitive flake in the cron-service batched-due-jobs path (uses a mutable All other checks pass on |
|
Force-pushed The bot's exact concern, addressed:
Fix shape:
Mixed-interleaving negative test added: it("trips takeover on a mixed external-then-pi append (publish refuses to launder external mutation)", async () => {
await controller.releaseForPrompt(); // F0 trusted
await fs.appendFile(sessionFile, '...external...', "utf8"); // F0 -> F1
const beforeWrite = readSessionFileFingerprintSync(sessionFile);// F1
await fs.appendFile(sessionFile, '...pi-after-external...'); // F1 -> F2
controller.publishOwnedPostMessageWrite(beforeWrite); // F1 NOT trusted: skip
await expect(session.agent.beforeToolCall()).rejects.toBeInstanceOf(
EmbeddedAttemptSessionTakeoverError, // fail closed
);
});Proof harness updated to a 4-scenario harness; Scenario D is the mixed-interleave fail-closed verification. All 4 PASS. Verbatim stdout in the PR body. 80 tests across the session-lock surface green (+1 from the new mixed-interleave test). 46 cases across the session-tool-result-guard surfaces also green — no regressions from the new @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper re-review PR body has been refreshed since the last review pass:
Asking for fresh consideration of the rating now that the Real behavior proof gate is green. |
|
🦞👀 Command router queued. I will update this comment with the next step. Re-review progress:
|
* fix(agents): skip fallback for session coordination errors Preserve provider fallback metadata when session coordination errors are nested under provider failures. Co-authored-by: luyao618 <364939526@qq.com> (cherry picked from commit 6a5a135) * fix(agents): tolerate in-process session writes during prompt release (openclaw#84250) Merged via squash. Prepared head SHA: 33f88fe Co-authored-by: tianxiaochannel-oss88 <272340815+tianxiaochannel-oss88@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman (cherry picked from commit 1b77145) * fix(agents): bound embedded compaction write locks Fixes the embedded attempt session write-lock watchdog so the fallback max hold time follows the resolved compaction timeout plus the existing lock grace window, instead of inheriting the full run timeout. Adds regression coverage for the helper and settled-compaction lock lifecycle, plus a changelog entry thanking @luoyanglang. Verification: - `pnpm test src/agents/session-write-lock.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts` - `pnpm check:changed` via Blacksmith Testbox `tbx_01ks8b6vn8se5cg1dfn3te3g47` / https://github.com/openclaw/openclaw/actions/runs/26301988670 - Autoreview clean: `/Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode branch --base origin/main` - PR CI green on `79e8c5f1a637981d263c0268bf5666967ff4e778`: https://github.com/openclaw/openclaw/actions/runs/26302152844 and https://github.com/openclaw/openclaw/actions/runs/26302152798 Co-authored-by: luoyanglang <hanwanlonga@gmail.com> (cherry picked from commit 46de078) * fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition (openclaw#85764) * fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition - Adds optional maxHoldMs parameter to inspectLockPayload - Inspect now marks locks as stale when held longer than maxHoldMs - Passes maxHoldMs through inspectLockPayloadForSession - acquireSessionWriteLock's shouldReclaim callback now passes maxHoldMs This ensures that when a live process holds a lock for longer than maxHoldMs (default 5min), other processes can reclaim it during acquisition — matching the watchdog's existing enforcement. Previously shouldReclaim only used staleMs (30min default), meaning a lock held for 10+ minutes by a live PID would never be reclaimable, causing 60s timeout failures and gateway freezes. Closes openclaw#85762 * fix(session-lock): add dead-PID fast-path before retry loop Adds a fast-path check at the top of acquireSessionWriteLock: if the lock file's owner PID is dead, remove it immediately before entering the retry loop. This saves up to timeoutMs (60s) of futile waiting when the previous lock holder has died. The shouldReclaim callback already handles this case, but only iteratively through the retry loop. The fast-path eliminates that unnecessary delay. * fix(session-lock): enforce max hold during acquisition * fix(session-lock): revalidate max hold safely * fix(session-lock): honor holder max-hold policy * fix(session-lock): keep cleanup from reclaiming live holders * fix(session-lock): remove stale locks only when unchanged * fix(session-lock): skip self-held max-hold reclaim * fix(ci): refresh gateway protocol checks --------- Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com> Co-authored-by: Peter Steinberger <steipete@gmail.com> (cherry picked from commit a1eb765) * fix(embedded-runner): preserve provider errors on cleanup takeover (openclaw#84321) Summary: - The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions. - PR surface: Source +52, Tests +92. Total +144 across 5 files. - Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback - PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405… Validation: - ClawSweeper review passed for head 050c779. - Required merge gates passed before the squash merge. Prepared head SHA: 050c779 Review: openclaw#84321 (comment) Co-authored-by: abnershang <abner.shang@gmail.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> (cherry picked from commit 7fbca96) * fix(agents): release embedded-attempt session lock on every exit path (openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com> (cherry picked from commit 32ddfc2) * fix(agents): fence yield abort lock release (cherry picked from commit 0fe7479) * fix(agents): memoize session lock owner args Memoize owner process argv lookups per PID during `cleanStaleLockFiles`, and yield between lock entries so startup cleanup does not monopolize the event loop while inspecting many session locks. This keeps lock classification semantics unchanged while avoiding repeated synchronous process-args reads for lock clusters owned by the same PID, especially the Windows PowerShell path. Fixes openclaw#86509. Verification: - `git diff --check origin/main...HEAD` - focused TSX harness against the current-main merge result: `session-lock memo regression harness passed` Thanks @openperf. Co-authored-by: openperf <16864032@qq.com> (cherry picked from commit c430fcd) * fix(diagnostics): recover orphaned session activity Recover idle queued sessions whose diagnostic activity retained stale ownerless model or tool calls by classifying them as recoverable session.stuck after the usual recovery gates. Yield the event loop before stale session-lock process inspection so sync process lookup cannot monopolize lock contention paths. Docs now describe the widened session.stuck telemetry contract for recoverable stale bookkeeping, including ownerless activity. Thanks @samuelsoaress. Refs openclaw#84903. Co-authored-by: samuelsoaress <samuelsoares177778@gmail.com> (cherry picked from commit 286964c) * [FORK][openclaw#86584] gate owned-write publish on pre-append fingerprint (fixes openclaw#86572) Carries unmerged upstream PR openclaw#86584 (HEAD d79a3b4) onto the boon 5.18 base as the same-lane EmbeddedAttemptSessionTakeoverError fence fix for long cron turns. Fails closed: an external mutation before pi's append fails the trust gate and still trips the fence (verified by the PR's 303-line test suite incl. the mixed-interleave negative test). Backfills base symbols openclaw#86584 assumes (introduced upstream between 5.18 and the PR base, not carried by the 9 merged race-fix picks): - session-lock.ts: MAX_BENIGN_SESSION_FENCE_{ADVANCE,REWRITE,REWRITE_RESULT}_BYTES, MAX_SAFE_FILE_OFFSET, TRANSCRIPT_ONLY_OPENCLAW_ASSISTANT_MODELS, SessionFileFenceSnapshot type, fenceSnapshot state var, ActiveWriteLockState type + activeWriteLock store fix (reuse nested writes via {active:true}), node:util + string-normalization imports. - transcript-append.ts: wrap appendSessionTranscriptMessage in runWithOwnedSessionTranscriptWriteLock so low-level appends acquire the owned-context lock. - test import fixes (appendSessionTranscriptMessage, withOwned/bindOwned, __testing). Drop when upstream merges openclaw#86584. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [FORK][openclaw#86584] wire owned-transcript-write context + typecheck cleanup CRITICAL: wrap promptActiveSession in withOwnedSessionTranscriptWrites and bind onBlockReply/onBlockReplyFlush to the owned context in attempt.ts. Without this, pi's own transcript appends during a prompt are NOT recorded as owned, so the fence trips on them (the exact takeover the backport is meant to prevent). This wiring is an intermediate-base feature (between 5.18 and openclaw#84250's base) the merged picks didn't carry. Tests passed before only because they set the context manually. Also: add releaseHeldLockForAbort to the controller type; drop incidental non-fence suppressAssistantErrorPersistence passes; remove dead async benign-rewrite cluster (sessionFence{Advance,Rewrite}IsBenign + readAppendedSessionFileText + lineMatchesLinearTranscriptMigration + helpers) — our openclaw#84250-based assertSessionFileFence uses the sync owned-write path, so the async benign-detection variants are unreachable. tsgo core: 0 errors. 384 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [FORK][openclaw#86584] address codex review: prefix-validate benign advance + preserve provider error Finding 2 (masking gap, P2): sessionFenceAdvanceIsBenignSync only validated the APPENDED bytes, so a writer that rewrote the existing prefix AND appended a benign delivery-mirror/gateway-injected line could be laundered as an owned advance — masking a genuine external takeover (silent message loss). Now fail closed unless the current prefix is byte-identical to the trusted readSessionFileFenceSnapshot text (readSessionFilePrefixSync); absent snapshot text => not benign. Finding 1 (provider-error masking, P2): wrappedStreamFn's finally let a reacquireAfterPrompt() takeover error mask the original provider error when the stream itself threw. Now only surface the reacquire error when the stream succeeded; otherwise preserve the original failure. tsgo core: 0 errors. 384 tests pass (benign-advance acceptance + external-mutation rejection both green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(release): 2026.5.18-boon.1 — session-takeover hardening (boon fleet build) Version bump + CHANGELOG for the fork build. Also fixes a backport test-import gap: attempt.test.ts referenced `attemptTesting` (the __testing export) without importing it. Full project typecheck (tsgo -b tsconfig.projects.json): 0 errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ci): no-unsafe-finally in wrappedStreamFn + drop collateral protocol/test churn - wrappedStreamFn: restructure provider-error-preservation without a throw inside finally (oxlint no-unsafe-finally). Same semantics: always reacquire; prefer the original stream error over a reacquire takeover error; surface reacquire error only when the stream succeeded. - Revert src/gateway/server-methods/agent.test.ts + GatewayModels.swift to the 5.18 baseline: the openclaw#85764 cherry-pick conflict-resolution had pulled in openclaw#85256-era internal-session-effect tests + protocol fields whose implementation isn't in this backport, breaking checks-node-agentic-gateway-methods + checks-fast-bundled-protocol. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: remove vestigial onAssistantErrorMessagePersisted option decls Address cubic P2 review (PR #2): the option was declared on the guard and guard-wrapper option types but never forwarded or invoked, so any provided callback was silently ignored. The companion error-suppression feature (suppressAssistantErrorPersistence + the agent-runner/followup caller chain) is deliberately scoped OUT of this 5.18 backport, so the decls were dead plumbing left over from a cherry-pick. Remove them to keep the option surface honest; the load-bearing beforeMessagePersist fence checkpoint (openclaw#86572) is retained. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yao <364939526@qq.com> Co-authored-by: xiaotian <tianxiaochannel@gmail.com> Co-authored-by: 狼哥 <hanwanlonga@gmail.com> Co-authored-by: njuboy <njuboy11@gmail.com> Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com> Co-authored-by: Peter Steinberger <steipete@gmail.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: abnershang <abner.shang@gmail.com> Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> Co-authored-by: Chunyue Wang <80630709+openperf@users.noreply.github.com> Co-authored-by: openperf <16864032@qq.com> Co-authored-by: Samuel Soares da Silva <samuelsoares177778@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d780c41 to
43baac8
Compare
|
@clawsweeper re-review Force-pushed Added (
Harness Scenario C (pure external advance) is intentionally not duplicated — it is already covered by the existing Verification on the rebased branch (
No source behavior changed in this push — the diff vs the prior head is the rebase plus the three test cases. |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Latest-head broad gate — Ran the current-head broad changed gate (the latest-head verification flagged in the last review) on node scripts/crabbox-wrapper.mjs run --provider local-container -- \
env CI=1 OPENCLAW_TEST_PROJECTS_PARALLEL=6 OPENCLAW_VITEST_MAX_WORKERS=1 \
OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=900000 pnpm check:changedResult:
Test execution (separate from Proof note: this ran on a local-container Linux runner (Node 24, CI-matching), not the brokered runner. A maintainer can re-run the brokered broad gate for the canonical environment if desired. |
|
Timeline correction on the production corroboration attributed to me — my 30-event corpus is PRE-#87159 in our install, not "post-#87159 still firing." Thanks for carrying the production data. One clarification I should make before it weighs on the merge, because the current PR body reads our corpus as the class "continues firing in real installs after #87159 landed" / "post-#87159 code" — that isn't accurate for our install:
So our corpus documents the class while it was active pre-#87159; it isn't evidence that it persists after #87159. The "rate accelerating / spread widening" in my 05-28 comment was the tail of the pre-upgrade period and shouldn't be read as current. What our data can't do is corroborate that the same-lane path this PR targets is still firing post-#87159 — once the cross-lane boundary is closed, our production window can't isolate a same-lane-only race. So I'd weight this PR on its own source-level proof (the 4-scenario harness — A trips pre-fix, B clean post-fix, C/D fail-closed — independently demonstrates the same-lane race and the trust gate's fail-closed behavior), rather than on our corpus as a post-#87159 signal. The 30 anonymized session |
…t to publish path (openclaw#86572)
|
Thanks @htuyer121 — appreciate the careful correction, and you're right. I've updated the PR body accordingly: the 30-event corpus is now described as pre-#87159 (build 2026.5.26), and the "continues firing after #87159" framing is gone. The "Production context" section now leads with your correction — zero events over ~6 days post-#87159, and the point that a same-lane-only race can't be isolated from production once #87159 closes the cross-lane boundary — and explicitly weights the PR on its source-level proof rather than the corpus, per your recommendation. For clarity on what the PR now claims: defense-in-depth hardening of the same-lane publish boundary (fail-open unconditional fence refresh → fail-closed trust gate), not a fix for an actively-firing production class. The meaningful source-level scenario is the mixed-interleave case (an external mutation interleaved with the lane's own write being laundered into the fence), which the trust gate refuses to record. The offer of the anonymized |
|
@clawsweeper re-review One cleanup commit plus a PR-body correction since the last pass on 1. Removed dead This is upstream 2. Reworked the two tests that referenced it —
3. PR body corrected (no code impact) Removed the inaccurate "@htuyer121's corpus shows the class continues firing after #87159" framing per their correction (corpus is pre-#87159; zero events post-#87159). Reframed the headline as fail-open → fail-closed publish-boundary hardening, and added a Fix shape vs. #86572 note (publish boundary, not the proposed ALS-scope hoist) plus a Known limitation note (the narrow pre-append window, owned by #87159's file-ownership boundary). Verification on
Net −23 LOC (source −7, tests −16). No production behavior change vs the prior head — this commit removes a now-dead path that the fix had already superseded and refreshes its coverage onto the production seam. |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Coverage note for the publish-boundary fix shape. A runtime-owned custom session entry can advance the same session file during the prompt-lock release window before the next message append. Observed ordering from a packaged openclaw@2026.6.1 Docker gateway run:
If the owned-write publish path is applied only to sessionManager.appendMessage, the later user-message append captures a pre-write fingerprint that already includes the unpublished custom entry. The fail-closed trust gate then refuses to publish the message append, so the fence still trips. In this scenario, the publish-boundary coverage needs to include runtime-owned custom session entries as well as message entries, or those custom writes need an equivalent ownership/locking path so they cannot advance the session file unrecorded during the prompt-lock release window. |
Summary
Converts the embedded session-takeover fence's post-write handling from a fail-open unconditional fence refresh into a fail-closed, trust-gated owned-write publish.
On current
main, after a lane's own message persists,onMessagePersisted → refreshAfterOwnedSessionWrite()unconditionally re-reads the session file and trusts whatever is on disk as the new fence baseline. If an external mutation interleaves betweenreleaseForPrompt(fence = F0) and the lane's own append (F1 → F2), that external mutation is silently laundered into the fence and the takeover detector never trips.This PR captures the file fingerprint immediately before the lane's append (
beforeMessagePersist) and records the post-write state as owned only when that pre-append fingerprint was trusted (matches the active fence, is a benign transcript-only advance, or is a previously-trusted state). Otherwise it refuses to record — so an interleaved external mutation can no longer be laundered, and the fence still trips.Closes #86572. Complementary to #87159: that PR closes the cross-lane file-ownership race; this PR closes the same-lane publish boundary. The two enforce orthogonal invariants.
Fix shape (vs. #86572's proposal)
#86572 proposed hoisting the
withOwnedSessionTranscriptWritesALS scope to spanagent.prompt(). This PR instead gates at the publish boundary (onMessagePersisted): pi's session writes already funnel through the guardedsessionManager.appendMessageseam, so the trust decision belongs in the controller that owns the fence, not in the ALS scope. This keeps the trust decision at the ownership boundary and avoids classifying unrelated listener writes as owned.Rebase note (force-pushed, head
43baac80c5)Rebased onto current
upstream/main, picking up:installSessionExternalHookWriteLock/installSessionEventWriteLock, moved hook locking into the controllerProduction context — corrected
So there is no current production signal for the specific same-lane path this PR hardens. Per the reporter's own recommendation, this PR should be weighed on its source-level proof, not on that corpus. This is defense-in-depth hardening of the publish boundary — not a fix for an actively-firing production class.
Real behavior proof
Behavior addressed: Converts the post-owned-write fence update from fail-open (unconditional refresh that can launder an interleaved external mutation into the fence) to fail-closed (record-as-owned only when the pre-append fingerprint was trusted). The behavioral delta versus current
mainis the mixed-interleave case (Scenario D):mainlaunders the external mutation into the fence; this PR refuses to, so the takeover still trips. Clean same-lane writes (Scenario B) behave identically tomain.Real environment tested: Source-level runtime harness on rebased branch
43baac80c5: macOS 26.5, Node 26.0.0, branched off currentupstream/main(post-Fix embedded session file ownership race #87159 + fix(agents): move session write lock into owned session runtime #87409); loads the patchedEmbeddedAttemptSessionLockControllerfrom source via dynamic import and drives four runtime scenarios against real temp session JSONL files viafs.appendFileandcontroller.withSessionWriteLock.Exact steps or command run after this patch:
git fetch origin && git checkout fix/session-takeover-refresh-before-hook-lock(confirm at43baac80c5), thennode --import tsx /Users/umank/code/openclaw-tickets/proof/run-refresh-before-hook-lock-proof.mjs /Users/umank/code/openclaw. Checked-in regression:node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts.Evidence after fix: terminal capture from the rebased branch follows. Note: in the harness, Scenario A's "pre-fix" label means the controller is simply not informed of the write (neither
main'srefreshAfterOwnedSessionWritenor this PR'spublishOwnedPostMessageWriteis called) — it demonstrates the controller mechanics, not a literalmainreproduction. The behavioral delta overmainis Scenario D.Observed result after fix: Scenarios A/B/C demonstrate the gate mechanics; Scenario D is the behavioral improvement over
main—main's unconditional refresh would record the combined F2 (external + pi) state as the trusted fence and silently absorb the external mutation, whereas the trust gate sees the untrusted pre-append baseline (F1) and refuses to record, so the takeover fence still fires. The checked-in regression covers the publish / skip / mixed-interleave cases directly.What was not tested:
appendFileSync. A concurrent external writer appending benign-looking transcript JSONL inside that narrow window could still be absorbed. Preventing concurrent writers to the same session file is the responsibility of Fix embedded session file ownership race #87159's file-ownership boundary, not this publish-boundary gate.Compatibility / Risk
publishOwnedPostMessageWrite(beforeWrite)is an additive method onEmbeddedAttemptSessionLockController.beforeMessagePersistis an additive optional opt on the session guard. Backward compatible.Related
This PR is AI-assisted. Code authored with Claude.