Skip to content

fix(agents): gate owned-write publish on pre-append fingerprint (#86572)#86584

Open
ubehera wants to merge 2 commits into
openclaw:mainfrom
ubehera:fix/session-takeover-refresh-before-hook-lock
Open

fix(agents): gate owned-write publish on pre-append fingerprint (#86572)#86584
ubehera wants to merge 2 commits into
openclaw:mainfrom
ubehera:fix/session-takeover-refresh-before-hook-lock

Conversation

@ubehera

@ubehera ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Converts the embedded session-takeover fence's post-write handling from a fail-open unconditional fence refresh into a fail-closed, trust-gated owned-write publish.

On current main, after a lane's own message persists, onMessagePersisted → refreshAfterOwnedSessionWrite() unconditionally re-reads the session file and trusts whatever is on disk as the new fence baseline. If an external mutation interleaves between releaseForPrompt (fence = F0) and the lane's own append (F1 → F2), that external mutation is silently laundered into the fence and the takeover detector never trips.

This PR captures the file fingerprint immediately before the lane's append (beforeMessagePersist) and records the post-write state as owned only when that pre-append fingerprint was trusted (matches the active fence, is a benign transcript-only advance, or is a previously-trusted state). Otherwise it refuses to record — so an interleaved external mutation can no longer be laundered, and the fence still trips.

Closes #86572. Complementary to #87159: that PR closes the cross-lane file-ownership race; this PR closes the same-lane publish boundary. The two enforce orthogonal invariants.

Fix shape (vs. #86572's proposal)

#86572 proposed hoisting the withOwnedSessionTranscriptWrites ALS scope to span agent.prompt(). This PR instead gates at the publish boundary (onMessagePersisted): pi's session writes already funnel through the guarded sessionManager.appendMessage seam, so the trust decision belongs in the controller that owns the fence, not in the ALS scope. This keeps the trust decision at the ownership boundary and avoids classifying unrelated listener writes as owned.

Rebase note (force-pushed, head 43baac80c5)

Rebased onto current upstream/main, picking up:

  • #87159 — canonical session-file ownership indexing
  • #87409 — removed installSessionExternalHookWriteLock / installSessionEventWriteLock, moved hook locking into the controller

Production context — corrected

Correction (see @htuyer121's comment): an earlier revision of this PR cited @htuyer121's EmbeddedAttemptSessionTakeoverError corpus as evidence the class "continues firing after #87159." That framing was inaccurate and the reporter has corrected it.

So there is no current production signal for the specific same-lane path this PR hardens. Per the reporter's own recommendation, this PR should be weighed on its source-level proof, not on that corpus. This is defense-in-depth hardening of the publish boundary — not a fix for an actively-firing production class.

Real behavior proof

  • Behavior addressed: Converts the post-owned-write fence update from fail-open (unconditional refresh that can launder an interleaved external mutation into the fence) to fail-closed (record-as-owned only when the pre-append fingerprint was trusted). The behavioral delta versus current main is the mixed-interleave case (Scenario D): main launders the external mutation into the fence; this PR refuses to, so the takeover still trips. Clean same-lane writes (Scenario B) behave identically to main.

  • Real environment tested: Source-level runtime harness on rebased branch 43baac80c5: macOS 26.5, Node 26.0.0, branched off current upstream/main (post-Fix embedded session file ownership race #87159 + fix(agents): move session write lock into owned session runtime #87409); loads the patched EmbeddedAttemptSessionLockController from source via dynamic import and drives four runtime scenarios against real temp session JSONL files via fs.appendFile and controller.withSessionWriteLock.

  • Exact steps or command run after this patch: git fetch origin && git checkout fix/session-takeover-refresh-before-hook-lock (confirm at 43baac80c5), then node --import tsx /Users/umank/code/openclaw-tickets/proof/run-refresh-before-hook-lock-proof.mjs /Users/umank/code/openclaw. Checked-in regression: node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts.

  • Evidence after fix: terminal capture from the rebased branch follows. Note: in the harness, Scenario A's "pre-fix" label means the controller is simply not informed of the write (neither main's refreshAfterOwnedSessionWrite nor this PR's publishOwnedPostMessageWrite is called) — it demonstrates the controller mechanics, not a literal main reproduction. The behavioral delta over main is Scenario D.

============================================================
PR #86584 — publish-on-onMessagePersisted proof (Issue #86572)
============================================================
source under test: src/agents/embedded-agent-runner/run/attempt.session-lock.ts
                   publishOwnedPostMessageWrite(beforeWrite)
                   gated on isTrustedSessionFileState(key, beforeWrite)

-------- Scenario A — pre-fix: pi-style write WITHOUT publish (trips takeover) --------
  [   0.13ms] setup      mode=pre-fix
  [   0.15ms] lifecycle  releaseForPrompt() — capture fence F0 + mark trusted
  [   0.47ms] pi-write   pi appendFileSync (F0 → F1) — no publish call
  [   0.60ms] hook       controller.withSessionWriteLock(beforeToolCallBody)
  [   1.62ms] hook       RESULT: TAKEOVER (EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /var/folders/hw/b1qn975s06qg...)
  [   1.64ms] state      hasSessionTakeover() = true

-------- Scenario B — post-fix: pi-style write WITH publish (hook fires cleanly) --------
  [   0.05ms] lifecycle  releaseForPrompt() — capture fence F0 + mark trusted
  [   0.18ms] pi-write   pi appendFileSync (F0 → F1) with pre-write captured
  [   0.27ms] publish    publishOwnedPostMessageWrite(F0) — F0 is trusted, record F1 as owned
  [   0.34ms] hook       controller.withSessionWriteLock(beforeToolCallBody)
  [   0.46ms] hook       RESULT: hook fired successfully (hookOutcome.fired=true)
  [   0.47ms] state      hasSessionTakeover() = false

-------- Scenario C — negative: external write bypasses publish (still trips) --------
  [   0.12ms] pi-write   EXTERNAL appendFileSync (F0 → F1) — no publish
  [   0.34ms] hook       RESULT: TAKEOVER (EmbeddedAttemptSessionTakeoverError)
  [   0.35ms] state      hasSessionTakeover() = true

-------- Scenario D — mixed interleave: external THEN pi write + publish (still trips) --------
  [   0.12ms] external   EXTERNAL appendFileSync first (F0 → F1)
  [   0.20ms] pi-write   pi appendFileSync second (F1 → F2) — pre-write captured = F1
  [   0.28ms] publish    publishOwnedPostMessageWrite(F1) — F1 NOT trusted, skip (fail closed)
  [   0.50ms] hook       RESULT: TAKEOVER (EmbeddedAttemptSessionTakeoverError)
  [   0.51ms] state      hasSessionTakeover() = true

============================================================
Summary
============================================================
  Scenario A (no fix):       hook fired=false, threw=TAKEOVER
  Scenario B (with fix):     hook fired=true, threw=no
  Scenario C (negative):     hook fired=false, threw=TAKEOVER
  Scenario D (mixed):        hook fired=false, threw=TAKEOVER

RESULT: PASS — all four scenarios behaved as expected.
  • Observed result after fix: Scenarios A/B/C demonstrate the gate mechanics; Scenario D is the behavioral improvement over mainmain's unconditional refresh would record the combined F2 (external + pi) state as the trusted fence and silently absorb the external mutation, whereas the trust gate sees the untrusted pre-append baseline (F1) and refuses to record, so the takeover fence still fires. The checked-in regression covers the publish / skip / mixed-interleave cases directly.

  • What was not tested:

    1. A live runtime reproduction against @htuyer121's corpus on the rebased branch — that corpus is pre-Fix embedded session file ownership race #87159 (see corrected timeline above), so it cannot corroborate the post-Fix embedded session file ownership race #87159 same-lane path.
    2. Known limitation: the trust gate captures the pre-append fingerprint immediately before appendFileSync. A concurrent external writer appending benign-looking transcript JSONL inside that narrow window could still be absorbed. Preventing concurrent writers to the same session file is the responsibility of Fix embedded session file ownership race #87159's file-ownership boundary, not this publish-boundary gate.

Compatibility / Risk

  • publishOwnedPostMessageWrite(beforeWrite) is an additive method on EmbeddedAttemptSessionLockController.
  • beforeMessagePersist is an additive optional opt on the session guard. Backward compatible.
  • The trust gate is fail-closed: any baseline mismatch refuses to record. Cannot launder external mutations.

Related


This PR is AI-assisted. Code authored with Claude.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 11:01 AM ET / 15:01 UTC.

Summary
The branch replaces unconditional embedded session-fence refresh after message persistence with a pre-append fingerprint-gated owned-write publish and adds controller regression coverage.

PR surface: Source +126, Tests +60. Total +186 across 5 files.

Reproducibility: yes. source-level. Current main unconditionally refreshes the fence after persistence, and the PR's terminal harness plus checked-in tests model the F0 to external F1 to pi F2 mixed interleave.

Review metrics: 1 noteworthy metric.

  • Session trust boundary: 1 fail-open refresh replaced; 3 trust-gate regression cases added. The measured change is small but directly affects which session fingerprints are accepted as owned before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Have an agents/session owner explicitly accept the fail-closed trust-boundary change.
  • Refresh exact-head checks if main moves again before landing.

Risk before merge

  • [P1] Merging changes which session-file fingerprints are trusted as owned during an embedded attempt; a wrong trust decision could hide an external transcript mutation or leave same-lane false takeover behavior unfixed.
  • [P1] The corrected production corpus is pre-Fix embedded session file ownership race #87159, so the post-Fix embedded session file ownership race #87159 same-lane value rests on source-level proof and checked-in regression coverage rather than live recurrence.

Maintainer options:

  1. Accept Boundary With Owner Signoff (recommended)
    Have an agents/session owner explicitly accept the fail-closed trust decision and land with the current head checks.
  2. Ask For Corpus Replay
    If maintainers want stronger proof, replay the offered anonymized JSONL corpus while noting it cannot isolate a post-Fix embedded session file ownership race #87159 same-lane-only failure.
  3. Pause If Ownership Fix Is Enough
    If maintainers decide Fix embedded session file ownership race #87159 fully owns the remaining failure class, pause or close this hardening PR and keep newer live reports canonical.

Next step before merge

  • No automated repair is indicated; the remaining action is maintainer acceptance of the session-state trust-boundary change.

Security
Cleared: No concrete security or supply-chain concern was found; the diff does not add dependencies, CI execution, secret handling, or broader permissions.

Review details

Best possible solution:

Land the fail-closed publish boundary only after agents/session owner acceptance, keeping the regression coverage and the linked canonical issue open until the PR merges.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level. Current main unconditionally refreshes the fence after persistence, and the PR's terminal harness plus checked-in tests model the F0 to external F1 to pi F2 mixed interleave.

Is this the best way to solve the issue?

Yes, likely. Gating at the publish boundary is narrower than hoisting the ALS scope because it uses the session-manager append seam and avoids broadly classifying unrelated listener writes as owned.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 6c7644268f59.

Label changes

Label justifications:

  • P1: The PR targets an embedded-agent session-state failure mode that can interrupt scheduled or channel-driven agent workflows.
  • merge-risk: 🚨 session-state: The diff changes the trusted fingerprint boundary for session JSONL writes during embedded attempts.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body supplies after-fix terminal output from a source-level runtime harness with real temp session JSONL files, plus checked-in regression and changed-gate evidence in comments.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix terminal output from a source-level runtime harness with real temp session JSONL files, plus checked-in regression and changed-gate evidence in comments.
Evidence reviewed

PR surface:

Source +126, Tests +60. Total +186 across 5 files.

View PR surface stats
Area Files Added Removed Net
Source 4 139 13 +126
Tests 1 78 18 +60
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 5 217 31 +186

What I checked:

Likely related people:

  • vincentkoc: Authored the embedded session write-fence commit that introduced the current fingerprint/takeover behavior this PR changes. (role: feature owner; confidence: high; commits: 2bb00f6726d4; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts)
  • joshavant: Authored the merged session-file ownership race fix that this PR explicitly complements. (role: adjacent feature owner; confidence: high; commits: 3349fe21bbde; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts, src/agents/embedded-agent-runner/run/attempt.ts)
  • steipete: Authored recent merged work moving the session write-lock seam into the owned session runtime and the broad agent-runtime internalization that renamed this path. (role: recent refactor author; confidence: high; commits: 5f68291f4f54, bb46b79d3c14; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts, src/agents/embedded-agent-runner/run/attempt.ts, src/agents/sessions/agent-session.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 25, 2026
@ubehera

ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

PR body updated with harness output (run-refresh-before-hook-lock-proof.mjs) showing deterministic before/after on a real temp session JSONL: pre-fix throws EmbeddedAttemptSessionTakeoverError, post-fix hook fires cleanly with hasSessionTakeover()=false.

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🌱 uncommon Gilded Proofling

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🌱 uncommon.
Trait: polishes edge cases.
Image traits: location diff observatory; accessory release bell; palette pearl, teal, and neon green; mood curious; pose peeking out from the egg shell; shell frosted glass shell; lighting moonlit rim light; background soft code-shaped tiles.
Share on X: post this hatch
Copy: My PR egg hatched a 🌱 uncommon Gilded Proofling in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@ubehera

ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

Heads-up: the checks-node-agentic-agents failure on this PR is a pre-existing flake on main, not caused by this PR.

The failing test is src/agents/model-catalog-visibility.test.ts:29 > resolveVisibleModelCatalog > limits visible catalog to provider wildcard entries after default discoveryError: Test timed out in 120000ms.

The same test, same line, same timeout fails on main CI run 26412030433 (push fix(security): audit Claude permission overrides under YOLO (#86557)) at 17:18Z — 13 minutes before this PR's run. The most recent successful main run (#86571 at 17:24Z) happens to pass it; the flake is timing-dependent.

This PR touches src/agents/pi-embedded-runner/run/attempt.session-lock.ts (+1 line) and the matching test file — completely separate code path from the model-catalog-visibility test. Local node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts exits 0.

Per AGENTS.md "If [a failing check is] unrelated on latest origin/main, say so with scoped proof" — flagging this so it doesn't become a maintainer-side speed bump.

@clawsweeper clawsweeper Bot added P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 25, 2026
ubehera added a commit to ubehera/openclaw that referenced this pull request May 25, 2026
…gated)

Replaces the prior refreshBeforeLock approach (91ba92c) per ClawSweeper
review on PR openclaw#86584. The earlier design used a blanket
refreshAfterOwnedSessionWrite() right before each external-hook lock
acquisition; that approach could silently accept genuine external mutations
because it advanced the fence without proving the new state was the lane's
own write.

New approach: route pi's _persist writes through the existing
ownedSessionFileWrites machinery instead of blind fence refreshes.

- Add EmbeddedAttemptSessionLockController.publishOwnedPostMessageWrite()
  (sync). Called from the onMessagePersisted callback right after pi's
  sessionManager.appendMessage -> _persist -> appendFileSync completes.
  Records the post-write fingerprint as an OWNED write in the
  process-shared ownedSessionFileWrites map so subsequent
  assertSessionFileFence calls accept the lane's own writes via the
  owned-write match path.

- Trust-gate: publishOwnedPostMessageWrite only records the new state
  when the prior fenceFingerprint is in trustedSessionFileStates (set at
  releaseForPrompt time). External mutations that bypass the callback are
  never recorded as owned and continue to trip
  EmbeddedAttemptSessionTakeoverError correctly.

- Revert installSessionExternalHookWriteLock's refreshBeforeLock param and
  the shared waitBeforeLock closure. Each installLockableFunction call goes
  back to waitBeforeLock: () => waitForSessionEventQueue(session) directly.

- runEmbeddedAttempt's onMessagePersisted callback now calls
  sessionLockController.publishOwnedPostMessageWrite() instead of
  refreshAfterOwnedSessionWrite().

Tests (5 cases under "embedded attempt session lock lifecycle"):

- "accepts pi-style writes published via publishOwnedPostMessageWrite
  before beforeToolCall fires" — positive case.
- "trips takeover on a same-file external write that bypasses
  publishOwnedPostMessageWrite" — NEGATIVE case the bot explicitly
  required ("add a negative test for an external same-file append before
  a hook lock"). Verifies fail-closed invariant preserved.
- "allows multiple prompt turns with pi-style writes published per turn"
  — multi-turn case, three iterations, no false takeover.
- "does not classify a different session file's writes as owned by this
  controller" — cross-file isolation.
- "does not hang when reacquireAfterPrompt rejects after a pi-style
  direct write" — abort/error path.

78 cases across 2 test files green; pnpm check:changed clean.

Closes openclaw#86572.
@ubehera ubehera changed the title fix(agents): refresh fence fingerprint before external-hook write lock (#86572) fix(agents): publish pi's owned writes via onMessagePersisted (#86572) May 25, 2026
@ubehera

ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

Redesigned per ClawSweeper review on 91ba92c9ab (force-pushed to d824be0298).

Change in approach:

  • Old (91ba92c9ab): installSessionExternalHookWriteLock got a refreshBeforeLock param that called refreshAfterOwnedSessionWrite() before each hook acquired the lock. That blanket-refreshed the fence even for genuine external mutations, weakening takeover detection.
  • New (d824be0298): A new publishOwnedPostMessageWrite() method is wired into onMessagePersisted (the existing pi _persist callback point). It records the post-write fingerprint as owned in ownedSessionFileWrites, gated on isTrustedSessionFileState(sessionFileFenceKey, fenceFingerprint). External mutations that bypass the callback never get published; the fence still trips on them via assertSessionFileFence's existing owned-write match path.

Addresses bot guidance verbatim:

"Please only advance the fence for writes proven to be this controller's own pi/session-manager persistence, and add a negative test for an external same-file append before a hook lock."

  • Trust gate: The new method only records when the prior fenceFingerprint is in trustedSessionFileStates (set at releaseForPrompt time). Same-file external mutations cannot bypass this — they don't go through onMessagePersisted.
  • Negative test added: trips takeover on a same-file external write that bypasses publishOwnedPostMessageWrite — verifies the fail-closed invariant.

Proof: Three-scenario harness covers pre-fix bug, post-fix success, and the negative case. Verbatim output in the PR body.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026
@clawsweeper clawsweeper Bot added status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 25, 2026
@ubehera

ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

CI status: checks-node-core-runtime-cron-service fails on the unrelated test src/cron/service/timer.regression.test.ts:989records per-job start time and duration for batched due jobs. Assertion: expected 70 to be 50 // Object.is equality.

This is a timing-sensitive flake in the cron-service batched-due-jobs path (uses a mutable let now shared between two synchronous mock job runs). It has no dependency on the files this PR touches (src/agents/pi-embedded-runner/run/attempt.session-lock.ts, attempt.session-lock.test.ts, attempt.ts).

All other checks pass on d824be0298, including checks-node-agentic-agents (which covers the embedded-runner test surface this PR modifies).

@ubehera ubehera changed the title fix(agents): publish pi's owned writes via onMessagePersisted (#86572) fix(agents): gate owned-write publish on pre-append fingerprint (#86572) May 25, 2026
@ubehera

ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

Force-pushed 584ba0e65b addressing ClawSweeper's correctness finding on d824be0298.

The bot's exact concern, addressed:

"This records the current fingerprint as owned whenever the previous fence fingerprint was trusted. If another lane/process appends to the same session file after releaseForPrompt() and then this lane appends its pi message before onMessagePersisted() runs, current contains both writes; recording that combined fingerprint makes the later hook-lock fence accept the external mutation as owned."

Fix shape:

  • publishOwnedPostMessageWrite() now takes a beforeWrite: SessionFileFingerprint | undefined argument and gates trust on the pre-append fingerprint, not on the current fenceFingerprint.
  • The pre-append fingerprint is captured by a new beforeMessagePersist hook in session-tool-result-guard, invoked before originalAppend(message). The snapshot flows back to the caller through onMessagePersisted's new context arg (typed unknown to keep the guard module-agnostic).
  • runEmbeddedAttempt wires beforeMessagePersist: () => readSessionFileFingerprintSync(params.sessionFile) and forwards the snapshot to publishOwnedPostMessageWrite(beforeWriteSnapshot).

Mixed-interleaving negative test added:

it("trips takeover on a mixed external-then-pi append (publish refuses to launder external mutation)", async () => {
  await controller.releaseForPrompt();                            // F0 trusted
  await fs.appendFile(sessionFile, '...external...', "utf8");     // F0 -> F1
  const beforeWrite = readSessionFileFingerprintSync(sessionFile);// F1
  await fs.appendFile(sessionFile, '...pi-after-external...');    // F1 -> F2
  controller.publishOwnedPostMessageWrite(beforeWrite);           // F1 NOT trusted: skip
  await expect(session.agent.beforeToolCall()).rejects.toBeInstanceOf(
    EmbeddedAttemptSessionTakeoverError,                          // fail closed
  );
});

Proof harness updated to a 4-scenario harness; Scenario D is the mixed-interleave fail-closed verification. All 4 PASS. Verbatim stdout in the PR body.

80 tests across the session-lock surface green (+1 from the new mixed-interleave test). 46 cases across the session-tool-result-guard surfaces also green — no regressions from the new beforeMessagePersist opt.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@ubehera

ubehera commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

PR body has been refreshed since the last review pass:

Asking for fresh consideration of the rating now that the Real behavior proof gate is green.

@clawsweeper

clawsweeper Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels May 29, 2026
augusteo added a commit to getboon/openclaw that referenced this pull request May 29, 2026
* fix(agents): skip fallback for session coordination errors

Preserve provider fallback metadata when session coordination errors are nested under provider failures.

Co-authored-by: luyao618 <364939526@qq.com>
(cherry picked from commit 6a5a135)

* fix(agents): tolerate in-process session writes during prompt release (openclaw#84250)

Merged via squash.

Prepared head SHA: 33f88fe
Co-authored-by: tianxiaochannel-oss88 <272340815+tianxiaochannel-oss88@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman

(cherry picked from commit 1b77145)

* fix(agents): bound embedded compaction write locks

Fixes the embedded attempt session write-lock watchdog so the fallback max hold time follows the resolved compaction timeout plus the existing lock grace window, instead of inheriting the full run timeout.

Adds regression coverage for the helper and settled-compaction lock lifecycle, plus a changelog entry thanking @luoyanglang.

Verification:
- `pnpm test src/agents/session-write-lock.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts`
- `pnpm check:changed` via Blacksmith Testbox `tbx_01ks8b6vn8se5cg1dfn3te3g47` / https://github.com/openclaw/openclaw/actions/runs/26301988670
- Autoreview clean: `/Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode branch --base origin/main`
- PR CI green on `79e8c5f1a637981d263c0268bf5666967ff4e778`: https://github.com/openclaw/openclaw/actions/runs/26302152844 and https://github.com/openclaw/openclaw/actions/runs/26302152798

Co-authored-by: luoyanglang <hanwanlonga@gmail.com>
(cherry picked from commit 46de078)

* fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition (openclaw#85764)

* fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition

- Adds optional maxHoldMs parameter to inspectLockPayload
- Inspect now marks locks as stale when held longer than maxHoldMs
- Passes maxHoldMs through inspectLockPayloadForSession
- acquireSessionWriteLock's shouldReclaim callback now passes maxHoldMs

This ensures that when a live process holds a lock for longer than
maxHoldMs (default 5min), other processes can reclaim it during
acquisition — matching the watchdog's existing enforcement.

Previously shouldReclaim only used staleMs (30min default), meaning
a lock held for 10+ minutes by a live PID would never be reclaimable,
causing 60s timeout failures and gateway freezes.

Closes openclaw#85762

* fix(session-lock): add dead-PID fast-path before retry loop

Adds a fast-path check at the top of acquireSessionWriteLock:
if the lock file's owner PID is dead, remove it immediately
before entering the retry loop. This saves up to timeoutMs (60s)
of futile waiting when the previous lock holder has died.

The shouldReclaim callback already handles this case, but only
iteratively through the retry loop. The fast-path eliminates
that unnecessary delay.

* fix(session-lock): enforce max hold during acquisition

* fix(session-lock): revalidate max hold safely

* fix(session-lock): honor holder max-hold policy

* fix(session-lock): keep cleanup from reclaiming live holders

* fix(session-lock): remove stale locks only when unchanged

* fix(session-lock): skip self-held max-hold reclaim

* fix(ci): refresh gateway protocol checks

---------

Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
(cherry picked from commit a1eb765)

* fix(embedded-runner): preserve provider errors on cleanup takeover (openclaw#84321)

Summary:
- The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.
- PR surface: Source +52, Tests +92. Total +144 across 5 files.
- Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405…

Validation:
- ClawSweeper review passed for head 050c779.
- Required merge gates passed before the squash merge.

Prepared head SHA: 050c779
Review: openclaw#84321 (comment)

Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
(cherry picked from commit 7fbca96)

* fix(agents): release embedded-attempt session lock on every exit path (openclaw#86427)

* fix(agents): release embedded-attempt session lock on every exit path

The embedded run controller acquires its session write lock eagerly at
creation and released it only inside the post-run cleanup block. An
exception thrown in post-prompt processing skipped that block, so the lock
leaked to the live gateway process until the watchdog reclaimed it and
later requests to the session failed with SessionWriteLockTimeoutError.

Add an idempotent dispose() to the lock controller and call it from the
run's outer finally so the eagerly-held lock is released on every exit
path. Normal/aborted/timed-out runs still hand the lock to
acquireForCleanup first, so dispose() is a no-op then (no double release).

Fixes openclaw#86014

* fix: keep session lock teardown comment lean

* docs(changelog): note embedded session lock fix

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
(cherry picked from commit 32ddfc2)

* fix(agents): fence yield abort lock release

(cherry picked from commit 0fe7479)

* fix(agents): memoize session lock owner args

Memoize owner process argv lookups per PID during `cleanStaleLockFiles`, and yield between lock entries so startup cleanup does not monopolize the event loop while inspecting many session locks.

This keeps lock classification semantics unchanged while avoiding repeated synchronous process-args reads for lock clusters owned by the same PID, especially the Windows PowerShell path.

Fixes openclaw#86509.

Verification:
- `git diff --check origin/main...HEAD`
- focused TSX harness against the current-main merge result: `session-lock memo regression harness passed`

Thanks @openperf.

Co-authored-by: openperf <16864032@qq.com>
(cherry picked from commit c430fcd)

* fix(diagnostics): recover orphaned session activity

Recover idle queued sessions whose diagnostic activity retained stale ownerless model or tool calls by classifying them as recoverable session.stuck after the usual recovery gates. Yield the event loop before stale session-lock process inspection so sync process lookup cannot monopolize lock contention paths.

Docs now describe the widened session.stuck telemetry contract for recoverable stale bookkeeping, including ownerless activity. Thanks @samuelsoaress.

Refs openclaw#84903.

Co-authored-by: samuelsoaress <samuelsoares177778@gmail.com>
(cherry picked from commit 286964c)

* [FORK][openclaw#86584] gate owned-write publish on pre-append fingerprint (fixes openclaw#86572)

Carries unmerged upstream PR openclaw#86584 (HEAD d79a3b4) onto the boon 5.18 base
as the same-lane EmbeddedAttemptSessionTakeoverError fence fix for long cron
turns. Fails closed: an external mutation before pi's append fails the trust
gate and still trips the fence (verified by the PR's 303-line test suite incl.
the mixed-interleave negative test).

Backfills base symbols openclaw#86584 assumes (introduced upstream between 5.18 and the
PR base, not carried by the 9 merged race-fix picks):
- session-lock.ts: MAX_BENIGN_SESSION_FENCE_{ADVANCE,REWRITE,REWRITE_RESULT}_BYTES,
  MAX_SAFE_FILE_OFFSET, TRANSCRIPT_ONLY_OPENCLAW_ASSISTANT_MODELS,
  SessionFileFenceSnapshot type, fenceSnapshot state var, ActiveWriteLockState
  type + activeWriteLock store fix (reuse nested writes via {active:true}),
  node:util + string-normalization imports.
- transcript-append.ts: wrap appendSessionTranscriptMessage in
  runWithOwnedSessionTranscriptWriteLock so low-level appends acquire the
  owned-context lock.
- test import fixes (appendSessionTranscriptMessage, withOwned/bindOwned, __testing).

Drop when upstream merges openclaw#86584.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [FORK][openclaw#86584] wire owned-transcript-write context + typecheck cleanup

CRITICAL: wrap promptActiveSession in withOwnedSessionTranscriptWrites and bind
onBlockReply/onBlockReplyFlush to the owned context in attempt.ts. Without this,
pi's own transcript appends during a prompt are NOT recorded as owned, so the
fence trips on them (the exact takeover the backport is meant to prevent). This
wiring is an intermediate-base feature (between 5.18 and openclaw#84250's base) the merged
picks didn't carry. Tests passed before only because they set the context manually.

Also: add releaseHeldLockForAbort to the controller type; drop incidental non-fence
suppressAssistantErrorPersistence passes; remove dead async benign-rewrite cluster
(sessionFence{Advance,Rewrite}IsBenign + readAppendedSessionFileText +
lineMatchesLinearTranscriptMigration + helpers) — our openclaw#84250-based assertSessionFileFence
uses the sync owned-write path, so the async benign-detection variants are unreachable.
tsgo core: 0 errors. 384 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [FORK][openclaw#86584] address codex review: prefix-validate benign advance + preserve provider error

Finding 2 (masking gap, P2): sessionFenceAdvanceIsBenignSync only validated the
APPENDED bytes, so a writer that rewrote the existing prefix AND appended a benign
delivery-mirror/gateway-injected line could be laundered as an owned advance —
masking a genuine external takeover (silent message loss). Now fail closed unless
the current prefix is byte-identical to the trusted readSessionFileFenceSnapshot
text (readSessionFilePrefixSync); absent snapshot text => not benign.

Finding 1 (provider-error masking, P2): wrappedStreamFn's finally let a
reacquireAfterPrompt() takeover error mask the original provider error when the
stream itself threw. Now only surface the reacquire error when the stream
succeeded; otherwise preserve the original failure.

tsgo core: 0 errors. 384 tests pass (benign-advance acceptance + external-mutation
rejection both green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(release): 2026.5.18-boon.1 — session-takeover hardening (boon fleet build)

Version bump + CHANGELOG for the fork build. Also fixes a backport test-import
gap: attempt.test.ts referenced `attemptTesting` (the __testing export) without
importing it. Full project typecheck (tsgo -b tsconfig.projects.json): 0 errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ci): no-unsafe-finally in wrappedStreamFn + drop collateral protocol/test churn

- wrappedStreamFn: restructure provider-error-preservation without a throw inside
  finally (oxlint no-unsafe-finally). Same semantics: always reacquire; prefer the
  original stream error over a reacquire takeover error; surface reacquire error
  only when the stream succeeded.
- Revert src/gateway/server-methods/agent.test.ts + GatewayModels.swift to the 5.18
  baseline: the openclaw#85764 cherry-pick conflict-resolution had pulled in openclaw#85256-era
  internal-session-effect tests + protocol fields whose implementation isn't in this
  backport, breaking checks-node-agentic-gateway-methods + checks-fast-bundled-protocol.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: remove vestigial onAssistantErrorMessagePersisted option decls

Address cubic P2 review (PR #2): the option was declared on the guard
and guard-wrapper option types but never forwarded or invoked, so any
provided callback was silently ignored. The companion error-suppression
feature (suppressAssistantErrorPersistence + the agent-runner/followup
caller chain) is deliberately scoped OUT of this 5.18 backport, so the
decls were dead plumbing left over from a cherry-pick. Remove them to
keep the option surface honest; the load-bearing beforeMessagePersist
fence checkpoint (openclaw#86572) is retained.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Yao <364939526@qq.com>
Co-authored-by: xiaotian <tianxiaochannel@gmail.com>
Co-authored-by: 狼哥 <hanwanlonga@gmail.com>
Co-authored-by: njuboy <njuboy11@gmail.com>
Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
Co-authored-by: Chunyue Wang <80630709+openperf@users.noreply.github.com>
Co-authored-by: openperf <16864032@qq.com>
Co-authored-by: Samuel Soares da Silva <samuelsoares177778@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tonga54 added a commit to arkimedes-dev/openclaw that referenced this pull request May 30, 2026
@ubehera ubehera force-pushed the fix/session-takeover-refresh-before-hook-lock branch from d780c41 to 43baac8 Compare June 2, 2026 08:29
@openclaw-barnacle openclaw-barnacle Bot added size: M and removed size: S proof: sufficient ClawSweeper judged the real behavior proof convincing. labels Jun 2, 2026
@ubehera

ubehera commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Force-pushed 43baac80c5 — rebased onto current upstream/main and added the checked-in regression coverage flagged in the last pass (the "0 new test cases added in the latest diff" metric / maintainer option "Ask For Checked-In Regression Coverage").

Added (src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts): a new describe("publishOwnedPostMessageWrite trust gate (#86572)") block with 3 cases that drive the publish path directly:

  1. skips publish → trips — a same-lane write that never calls publishOwnedPostMessageWrite still trips the fence (fail-closed default; publish is the required opt-in).
  2. publish + trusted baseline → accepts — pre-append fingerprint F0 is trusted, so the lane's own write is recorded as owned and the hook fires cleanly (hasSessionTakeover() === false).
  3. mixed interleave → still trips — an external write advances F0→F1 first, so the captured baseline (F1) was never trusted; publishOwnedPostMessageWrite refuses to record the combined F2 state and the takeover fence still fires. This is the same-lane/mixed-interleave scenario from the proof harness, now checked in.

Harness Scenario C (pure external advance) is intentionally not duplicated — it is already covered by the existing rejects post-prompt writes when another owner advances the session file test.

Verification on the rebased branch (43baac80c5):

  • node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts → 45/45 pass.
  • Full touched surface (attempt.session-lock + session-tool-result-guard{,.transcript-events,.tool-result-persist-hook}) → 113/113 pass.
  • oxlint exit 0; oxfmt --check clean.
  • Mutation-checked for teeth: test 2 fails if the owned-write recording is removed; test 3 fails if the fail-closed trust gate is removed — confirming each test guards the fix rather than restating current behavior.

No source behavior changed in this push — the diff vs the prior head is the rebase plus the three test cases.

@clawsweeper

clawsweeper Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026
@ubehera

ubehera commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Latest-head broad gate — pnpm check:changed green on Linux

Ran the current-head broad changed gate (the latest-head verification flagged in the last review) on 43baac80c5, via crabbox local-container (Linux, ubuntu:26.04, Node 24), scoped to this PR's 5 changed files:

node scripts/crabbox-wrapper.mjs run --provider local-container -- \
  env CI=1 OPENCLAW_TEST_PROJECTS_PARALLEL=6 OPENCLAW_VITEST_MAX_WORKERS=1 \
  OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=900000 pnpm check:changed

Result: exit=0, 0 failures (command 3m08s / total 5m36s). Changed-gate scope confirmed = exactly the 5 changed files (lanes=core, coreTests); no unrelated drift. Green steps:

  • typecheck core + typecheck core tests
  • lint core changed files — Found 0 warnings and 0 errors
  • dependency pin guard (464 specs, 0 violations), package patch guard
  • conflict markers, changelog attributions, plugin-sdk / extension wildcard re-export guards, duplicate-coverage, runtime import cycles, media / sidecar / webhook / pairing guards

Test execution (separate from check:changed): focused run-vitest on the touched surface is green — attempt.session-lock.test.ts 45/45 (including the 3 new publishOwnedPostMessageWrite trust-gate cases) plus the three session-tool-result-guard suites; 103 tests across 4 files. All GitHub checks on 43baac80c5 are green as well.

Proof note: this ran on a local-container Linux runner (Node 24, CI-matching), not the brokered runner. A maintainer can re-run the brokered broad gate for the canonical environment if desired.

@htuyer121

Copy link
Copy Markdown

Timeline correction on the production corroboration attributed to me — my 30-event corpus is PRE-#87159 in our install, not "post-#87159 still firing."

Thanks for carrying the production data. One clarification I should make before it weighs on the merge, because the current PR body reads our corpus as the class "continues firing in real installs after #87159 landed" / "post-#87159 code" — that isn't accurate for our install:

So our corpus documents the class while it was active pre-#87159; it isn't evidence that it persists after #87159. The "rate accelerating / spread widening" in my 05-28 comment was the tail of the pre-upgrade period and shouldn't be read as current.

What our data can't do is corroborate that the same-lane path this PR targets is still firing post-#87159 — once the cross-lane boundary is closed, our production window can't isolate a same-lane-only race. So I'd weight this PR on its own source-level proof (the 4-scenario harness — A trips pre-fix, B clean post-fix, C/D fail-closed — independently demonstrates the same-lane race and the trust gate's fail-closed behavior), rather than on our corpus as a post-#87159 signal. The 30 anonymized session .jsonl files (all pre-#87159) remain available if a diff-driven repro is still useful.

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026
@ubehera

ubehera commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @htuyer121 — appreciate the careful correction, and you're right. I've updated the PR body accordingly: the 30-event corpus is now described as pre-#87159 (build 2026.5.26), and the "continues firing after #87159" framing is gone. The "Production context" section now leads with your correction — zero events over ~6 days post-#87159, and the point that a same-lane-only race can't be isolated from production once #87159 closes the cross-lane boundary — and explicitly weights the PR on its source-level proof rather than the corpus, per your recommendation.

For clarity on what the PR now claims: defense-in-depth hardening of the same-lane publish boundary (fail-open unconditional fence refresh → fail-closed trust gate), not a fix for an actively-firing production class. The meaningful source-level scenario is the mixed-interleave case (an external mutation interleaved with the lane's own write being laundered into the fence), which the trust gate refuses to record.

The offer of the anonymized .jsonl files is still genuinely useful for a future diff-driven repro — thank you for that.

@ubehera

ubehera commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

One cleanup commit plus a PR-body correction since the last pass on 43baac80c5. New head b06e2b72d0 (fast-forward, not a force-push — the reviewed feature commit 43baac80c5 is untouched).

1. Removed dead refreshAfterOwnedSessionWrite (method + interface entry) — src/agents/embedded-agent-runner/run/attempt.session-lock.ts

This is upstream main's method (introduced in 2bb00f6726 "fix(agents): fence embedded session writes", carried through the #85341 rename). Its only caller was onMessagePersisted, which this PR's core fix already rewires to publishOwnedPostMessageWrite(beforeWriteSnapshot) (attempt.ts). Superseding that single call site left the method with zero production references — rg -uu (including generated/ignored files) confirms no remaining source reference on this branch; the only hits are regenerated build artifacts and git internals. Removing it completes the refactor to one canonical path: the fail-open unconditional refresh is exactly the behavior this PR replaces, so leaving it defined would strand a dead fail-open path beside the new fail-closed gate. Not a plugin-sdk export (the type is internal; its only structural consumer, attempt-abort.ts, uses Pick<…, "releaseHeldLockForAbort">, unaffected).

2. Reworked the two tests that referenced itattempt.session-lock.test.ts

  • Rewrote "keeps post-provider transcript writes owned after prompt stream returns" onto the production method publishOwnedPostMessageWrite(beforeWrite), capturing the pre-append fingerprint exactly as beforeMessagePersist does. This preserves its unique post-reacquireAfterPrompt coverage and is mutation-sensitive: if the publish were a no-op, acquireForCleanup's assertSessionFileFence sees current=F1 ≠ fence=F0, trips, and the test fails. Added a comment marking it as the deliberate post-reacquire path.
  • Deleted "refreshes the prompt fence after an owned session manager append" as redundant — it tested the removed method, and its production-API analog is the existing "accepts a same-lane write published with a trusted pre-append fingerprint" trust-gate case.

3. PR body corrected (no code impact)

Removed the inaccurate "@htuyer121's corpus shows the class continues firing after #87159" framing per their correction (corpus is pre-#87159; zero events post-#87159). Reframed the headline as fail-open → fail-closed publish-boundary hardening, and added a Fix shape vs. #86572 note (publish boundary, not the proposed ALS-scope hoist) plus a Known limitation note (the narrow pre-append window, owned by #87159's file-ownership boundary).

Verification on b06e2b72d0:

  • node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts44/44 (was 45; −1 from the redundant-test deletion).
  • oxlint exit 0; oxfmt --check clean.
  • crabbox local-container pnpm check:changed (Ubuntu 26.04, Node 24) → exit=0 (command 3m16s / total 5m40s): typecheck core + core tests clean, lint 0 warnings / 0 errors on the 5 changed files, 0 runtime import cycles, all guards (dependency pins, package patches, media/sidecar/webhook/pairing) pass.

Net −23 LOC (source −7, tests −16). No production behavior change vs the prior head — this commit removes a now-dead path that the fix had already superseded and refreshes its coverage onto the production seam.

@clawsweeper

clawsweeper Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 2, 2026
@sparkli1022-lab

sparkli1022-lab commented Jun 5, 2026

Copy link
Copy Markdown

Coverage note for the publish-boundary fix shape.

A runtime-owned custom session entry can advance the same session file during the prompt-lock release window before the next message append.

Observed ordering from a packaged openclaw@2026.6.1 Docker gateway run:

  • thinking_level_change
  • custom entry model-snapshot
  • user message
  • subsequent fence check raised EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released

If the owned-write publish path is applied only to sessionManager.appendMessage, the later user-message append captures a pre-write fingerprint that already includes the unpublished custom entry. The fail-closed trust gate then refuses to publish the message append, so the fence still trips.

In this scenario, the publish-boundary coverage needs to include runtime-owned custom session entries as well as message entries, or those custom writes need an equivalent ownership/locking path so they cannot advance the session file unrecorded during the prompt-lock release window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hoist withOwnedSessionTranscriptWrites ALS scope to span agent.prompt() to fix vanilla-openclaw same-lane fence trip

5 participants