Skip to content

fix(embedded-runner): preserve provider errors on cleanup takeover#84321

Merged
clawsweeper[bot] merged 3 commits into
mainfrom
clawsweeper/automerge-openclaw-openclaw-84056
May 26, 2026
Merged

fix(embedded-runner): preserve provider errors on cleanup takeover#84321
clawsweeper[bot] merged 3 commits into
mainfrom
clawsweeper/automerge-openclaw-openclaw-84056

Conversation

@clawsweeper

@clawsweeper clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Makes #84056 merge-ready for the ClawSweeper automerge loop.
The edit pass should inspect the live PR diff, review comments, and failing checks; rebase if needed; keep the contributor branch credited; and stop only when validation is green or an external blocker is proven.

ClawSweeper 🐠 replacement reef notes:

  • Repair fallback: GitHub rejected the repair branch push because it updates workflow files and the ClawSweeper app token does not have workflows permission

Co-author credit kept:

fish notes: model gpt-5.5, reasoning high; reviewed against e7d9d8c.

@clawsweeper clawsweeper Bot added agents Agent runtime and tooling size: S clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge proof: supplied External PR includes structured after-fix real behavior proof. proof: sufficient ClawSweeper judged the real behavior proof convincing. P2 Normal backlog priority with limited blast radius. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. clawsweeper Tracked by ClawSweeper automation labels May 19, 2026
@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

Codex review: passed. Reviewed May 25, 2026, 11:05 PM ET / 03:05 UTC.

Summary
The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.

PR surface: Source +52, Tests +92. Total +144 across 5 files.

Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/provider error and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none

Risk before merge

  • Merging intentionally makes takeover-marked provider errors abort model fallback, so configured fallback models will not be tried when the visible provider message also looks failoverable.
  • The proof is a runtime classifier probe plus focused embedded-runner/fallback harness coverage, not a forced live provider outage with an actual cleanup-race log.

Maintainer options:

  1. Accept the fallback/session takeover contract (recommended)
    Merge after required checks if maintainers agree that a local cleanup takeover should abort model fallback even when the provider-facing message looks failoverable.
  2. Pause for live cleanup-race proof
    Ask for a redacted runtime log or terminal artifact from an actual provider failure followed by cleanup takeover if boundary-level proof is not enough for this session-state path.

Next step before merge
No ClawSweeper repair lane is needed; the remaining path is exact-head merge gating and maintainer acceptance of the fallback/session-state risk.

Security
Cleared: The diff touches TypeScript agent runtime code and tests only; I found no concrete dependency, workflow, secret-handling, package, or supply-chain regression.

Review details

Best possible solution:

Land the narrow error-precedence and fallback-abort fix after exact-head merge gates if maintainers accept boundary-level proof for this internal session takeover race.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/provider error and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Is this the best way to solve the issue?

Yes. The patch is narrow: it preserves the provider-facing message while carrying the takeover identity/cause so fallback stops, and it covers cleanup-only takeover separately.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 0d23c3b4e133.

Label changes

Label changes:

  • remove status: 👀 ready for maintainer look: Current PR status label is status: 🚀 automerge armed.

Label justifications:

  • P2: This is a focused bug fix for an agent runtime fallback edge case with limited but real provider/session impact.
  • merge-risk: 🚨 auth-provider: The PR changes when provider/model fallback aborts instead of trying configured fallback candidates.
  • merge-risk: 🚨 session-state: The PR changes cleanup session-takeover precedence and preserves takeover state through the thrown error path.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 🚀 automerge armed: This PR is in ClawSweeper's automerge lane. Sufficient (terminal): The source PR provides structured after-fix terminal output from a runtime fallback-classifier probe plus focused validation, and this replacement keeps that proof trail.
  • proof: sufficient: Contributor real behavior proof is sufficient. The source PR provides structured after-fix terminal output from a runtime fallback-classifier probe plus focused validation, and this replacement keeps that proof trail.
Evidence reviewed

PR surface:

Source +52, Tests +92. Total +144 across 5 files.

View PR surface stats
Area Files Added Removed Net
Source 3 56 4 +52
Tests 2 92 0 +92
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 5 148 4 +144

What I checked:

  • Current main cleanup precedence still drops the provider error: Current main emits and rejects cleanupError ahead of promptError, so a cleanup-time session takeover can replace the earlier provider/prompt failure before the caller sees it. (src/agents/pi-embedded-runner/run/attempt.ts:5176, 0d23c3b4e133)
  • Current main fallback can normalize takeover-marked provider-looking errors: runFallbackCandidate currently normalizes failoverable-looking errors before a local coordination check, so a takeover wrapper with a rate-limit-looking message can become provider failover. (src/agents/model-fallback.ts:257, 0d23c3b4e133)
  • PR preserves the prompt error while carrying cleanup takeover: The PR synthesizes or captures cleanup takeover, emits diagnostics with the prompt error when preserving it, and throws a wrapper whose visible message is the provider error while the cause remains EmbeddedAttemptSessionTakeoverError. (src/agents/pi-embedded-runner/run/attempt.ts:5199, 050c779cfa61)
  • PR checks coordination before failover normalization: The patch adds an early isNonProviderRuntimeCoordinationError check before coerceToFailoverError in runFallbackCandidate, preventing fallback from consuming another model after local session takeover. (src/agents/model-fallback.ts:261, 050c779cfa61)
  • Focused regression coverage is present: The PR adds embedded-runner tests for prompt-error preservation and cleanup-only takeover, plus model-fallback coverage proving a takeover-carrying provider error aborts after one attempt. (src/agents/pi-embedded-runner/run/attempt.spawn-workspace.context-engine.test.ts:1317, 050c779cfa61)
  • Source proof and replacement context were checked: The superseded source PR contains structured after-fix terminal proof for the runtime fallback classifier plus focused Vitest, tsgo, check:changed, and diff-check commands; this bot replacement preserves that credited work. (3240d6764653)

Likely related people:

  • vincentkoc: Recent commits in the embedded-runner/session-fencing area and fallback/provider-resolution path make this a strong routing candidate for the session takeover and fallback boundary. (role: recent area contributor; confidence: high; commits: 2bb00f6726d4, a122d804dda8, 3c8d101f5a85; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/model-fallback.ts)
  • steipete: Recent history for model fallback and failover classification includes multiple steipete-authored changes to fallback selection and failover signal behavior. (role: fallback and failover area contributor; confidence: medium; commits: f4ba9553c029, 8c49121ec881, 936c02e22c98; files: src/agents/model-fallback.ts, src/agents/failover-error.ts)
  • jalehman: Recent embedded session write-lock work in the same takeover/fence area includes jalehman as a co-author or reviewer, which makes them useful for session-state contract review. (role: adjacent session-fencing reviewer/contributor; confidence: medium; commits: 1b77145687ca, cff5244a5b25; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/attempt.session-lock.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@openclaw-barnacle openclaw-barnacle Bot removed the proof: supplied External PR includes structured after-fix real behavior proof. label May 19, 2026
@clawsweeper clawsweeper Bot added status: 🚀 automerge armed This PR is in ClawSweeper's automerge lane. clawsweeper:human-review Needs maintainer review before ClawSweeper can continue and removed status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 19, 2026
@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

🦞✅
ClawSweeper is pausing this repair loop for human review.

Source: clawsweeper[bot]
Reason: No repair job is needed; the remaining action is normal automerge or maintainer handling after exact-head checks and risk acceptance.; Cleared: The diff touches TypeScript agent runtime code, tests, and changelog only; I found no concrete dependency, CI, secret-handling, package, or supply-chain regression. (sha=e7d9d8cafeb2c5040e220bae5a0054a7623a0adf)

Why human review is needed:
This item has security-sensitive risk. ClawSweeper is pausing instead of making an autonomous change that could affect trust, credentials, permissions, or exposure.

Recommended next action:
Have a maintainer review the security-sensitive detail and provide an explicit safe path before asking ClawSweeper to continue.

I added clawsweeper:human-review and left the final call with a maintainer.

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

ClawSweeper PR egg

✨ Hatched: 🥚 common Gilded Patch Peep

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: sniffs out flaky tests.
Image traits: location status garden; accessory commit compass; palette plum, gold, and soft gray; mood bright-eyed; pose pointing at a small proof artifact; shell starlit enamel shell; lighting subtle sparkle highlights; background small review tokens.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Gilded Patch Peep in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@clawsweeper

clawsweeper Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

🦞✅
ClawSweeper merged this PR after the passing review.

Source: clawsweeper[bot]
Feedback: structured ClawSweeper verdict: pass (sha=050c779cfa613efc14f6bc7713fcaedde27b0f7c)
Merge status: merged by ClawSweeper automerge
Merged at: 2026-05-26T03:09:27Z
Merge commit: 7fbca96a0cda

What merged:

  • The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.
  • PR surface: Source +52, Tests +92. Total +144 across 5 files.
  • Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Automerge notes:

  • PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback
  • PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405…

The automerge loop is complete.

Automerge progress:

  • 2026-05-26 02:43:31 UTC review queued e7d9d8cafeb2 (queued)
  • 2026-05-26 03:09:15 UTC review passed 050c779cfa61 (structured ClawSweeper verdict: pass (sha=050c779cfa613efc14f6bc7713fcaedde27b0...)
  • 2026-05-26 03:09:29 UTC merged 050c779cfa61 (merged by ClawSweeper automerge)

@clawsweeper clawsweeper Bot removed the clawsweeper:human-review Needs maintainer review before ClawSweeper can continue label May 26, 2026
@clawsweeper clawsweeper Bot force-pushed the clawsweeper/automerge-openclaw-openclaw-84056 branch from e7d9d8c to 050c779 Compare May 26, 2026 02:58
@clawsweeper clawsweeper Bot added proof: supplied External PR includes structured after-fix real behavior proof. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 26, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: supplied External PR includes structured after-fix real behavior proof. label May 26, 2026
@clawsweeper clawsweeper Bot removed the status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. label May 26, 2026
@clawsweeper clawsweeper Bot merged commit 7fbca96 into main May 26, 2026
115 of 120 checks passed
@clawsweeper clawsweeper Bot deleted the clawsweeper/automerge-openclaw-openclaw-84056 branch May 26, 2026 03:09
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 26, 2026
…penclaw#84321)

Summary:
- The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.
- PR surface: Source +52, Tests +92. Total +144 across 5 files.
- Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405…

Validation:
- ClawSweeper review passed for head 050c779.
- Required merge gates passed before the squash merge.

Prepared head SHA: 050c779
Review: openclaw#84321 (comment)

Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
augusteo added a commit to getboon/openclaw that referenced this pull request May 29, 2026
* fix(agents): skip fallback for session coordination errors

Preserve provider fallback metadata when session coordination errors are nested under provider failures.

Co-authored-by: luyao618 <364939526@qq.com>
(cherry picked from commit 6a5a135)

* fix(agents): tolerate in-process session writes during prompt release (openclaw#84250)

Merged via squash.

Prepared head SHA: 33f88fe
Co-authored-by: tianxiaochannel-oss88 <272340815+tianxiaochannel-oss88@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman

(cherry picked from commit 1b77145)

* fix(agents): bound embedded compaction write locks

Fixes the embedded attempt session write-lock watchdog so the fallback max hold time follows the resolved compaction timeout plus the existing lock grace window, instead of inheriting the full run timeout.

Adds regression coverage for the helper and settled-compaction lock lifecycle, plus a changelog entry thanking @luoyanglang.

Verification:
- `pnpm test src/agents/session-write-lock.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts`
- `pnpm check:changed` via Blacksmith Testbox `tbx_01ks8b6vn8se5cg1dfn3te3g47` / https://github.com/openclaw/openclaw/actions/runs/26301988670
- Autoreview clean: `/Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode branch --base origin/main`
- PR CI green on `79e8c5f1a637981d263c0268bf5666967ff4e778`: https://github.com/openclaw/openclaw/actions/runs/26302152844 and https://github.com/openclaw/openclaw/actions/runs/26302152798

Co-authored-by: luoyanglang <hanwanlonga@gmail.com>
(cherry picked from commit 46de078)

* fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition (openclaw#85764)

* fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition

- Adds optional maxHoldMs parameter to inspectLockPayload
- Inspect now marks locks as stale when held longer than maxHoldMs
- Passes maxHoldMs through inspectLockPayloadForSession
- acquireSessionWriteLock's shouldReclaim callback now passes maxHoldMs

This ensures that when a live process holds a lock for longer than
maxHoldMs (default 5min), other processes can reclaim it during
acquisition — matching the watchdog's existing enforcement.

Previously shouldReclaim only used staleMs (30min default), meaning
a lock held for 10+ minutes by a live PID would never be reclaimable,
causing 60s timeout failures and gateway freezes.

Closes openclaw#85762

* fix(session-lock): add dead-PID fast-path before retry loop

Adds a fast-path check at the top of acquireSessionWriteLock:
if the lock file's owner PID is dead, remove it immediately
before entering the retry loop. This saves up to timeoutMs (60s)
of futile waiting when the previous lock holder has died.

The shouldReclaim callback already handles this case, but only
iteratively through the retry loop. The fast-path eliminates
that unnecessary delay.

* fix(session-lock): enforce max hold during acquisition

* fix(session-lock): revalidate max hold safely

* fix(session-lock): honor holder max-hold policy

* fix(session-lock): keep cleanup from reclaiming live holders

* fix(session-lock): remove stale locks only when unchanged

* fix(session-lock): skip self-held max-hold reclaim

* fix(ci): refresh gateway protocol checks

---------

Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
(cherry picked from commit a1eb765)

* fix(embedded-runner): preserve provider errors on cleanup takeover (openclaw#84321)

Summary:
- The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.
- PR surface: Source +52, Tests +92. Total +144 across 5 files.
- Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405…

Validation:
- ClawSweeper review passed for head 050c779.
- Required merge gates passed before the squash merge.

Prepared head SHA: 050c779
Review: openclaw#84321 (comment)

Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
(cherry picked from commit 7fbca96)

* fix(agents): release embedded-attempt session lock on every exit path (openclaw#86427)

* fix(agents): release embedded-attempt session lock on every exit path

The embedded run controller acquires its session write lock eagerly at
creation and released it only inside the post-run cleanup block. An
exception thrown in post-prompt processing skipped that block, so the lock
leaked to the live gateway process until the watchdog reclaimed it and
later requests to the session failed with SessionWriteLockTimeoutError.

Add an idempotent dispose() to the lock controller and call it from the
run's outer finally so the eagerly-held lock is released on every exit
path. Normal/aborted/timed-out runs still hand the lock to
acquireForCleanup first, so dispose() is a no-op then (no double release).

Fixes openclaw#86014

* fix: keep session lock teardown comment lean

* docs(changelog): note embedded session lock fix

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
(cherry picked from commit 32ddfc2)

* fix(agents): fence yield abort lock release

(cherry picked from commit 0fe7479)

* fix(agents): memoize session lock owner args

Memoize owner process argv lookups per PID during `cleanStaleLockFiles`, and yield between lock entries so startup cleanup does not monopolize the event loop while inspecting many session locks.

This keeps lock classification semantics unchanged while avoiding repeated synchronous process-args reads for lock clusters owned by the same PID, especially the Windows PowerShell path.

Fixes openclaw#86509.

Verification:
- `git diff --check origin/main...HEAD`
- focused TSX harness against the current-main merge result: `session-lock memo regression harness passed`

Thanks @openperf.

Co-authored-by: openperf <16864032@qq.com>
(cherry picked from commit c430fcd)

* fix(diagnostics): recover orphaned session activity

Recover idle queued sessions whose diagnostic activity retained stale ownerless model or tool calls by classifying them as recoverable session.stuck after the usual recovery gates. Yield the event loop before stale session-lock process inspection so sync process lookup cannot monopolize lock contention paths.

Docs now describe the widened session.stuck telemetry contract for recoverable stale bookkeeping, including ownerless activity. Thanks @samuelsoaress.

Refs openclaw#84903.

Co-authored-by: samuelsoaress <samuelsoares177778@gmail.com>
(cherry picked from commit 286964c)

* [FORK][openclaw#86584] gate owned-write publish on pre-append fingerprint (fixes openclaw#86572)

Carries unmerged upstream PR openclaw#86584 (HEAD d79a3b4) onto the boon 5.18 base
as the same-lane EmbeddedAttemptSessionTakeoverError fence fix for long cron
turns. Fails closed: an external mutation before pi's append fails the trust
gate and still trips the fence (verified by the PR's 303-line test suite incl.
the mixed-interleave negative test).

Backfills base symbols openclaw#86584 assumes (introduced upstream between 5.18 and the
PR base, not carried by the 9 merged race-fix picks):
- session-lock.ts: MAX_BENIGN_SESSION_FENCE_{ADVANCE,REWRITE,REWRITE_RESULT}_BYTES,
  MAX_SAFE_FILE_OFFSET, TRANSCRIPT_ONLY_OPENCLAW_ASSISTANT_MODELS,
  SessionFileFenceSnapshot type, fenceSnapshot state var, ActiveWriteLockState
  type + activeWriteLock store fix (reuse nested writes via {active:true}),
  node:util + string-normalization imports.
- transcript-append.ts: wrap appendSessionTranscriptMessage in
  runWithOwnedSessionTranscriptWriteLock so low-level appends acquire the
  owned-context lock.
- test import fixes (appendSessionTranscriptMessage, withOwned/bindOwned, __testing).

Drop when upstream merges openclaw#86584.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [FORK][openclaw#86584] wire owned-transcript-write context + typecheck cleanup

CRITICAL: wrap promptActiveSession in withOwnedSessionTranscriptWrites and bind
onBlockReply/onBlockReplyFlush to the owned context in attempt.ts. Without this,
pi's own transcript appends during a prompt are NOT recorded as owned, so the
fence trips on them (the exact takeover the backport is meant to prevent). This
wiring is an intermediate-base feature (between 5.18 and openclaw#84250's base) the merged
picks didn't carry. Tests passed before only because they set the context manually.

Also: add releaseHeldLockForAbort to the controller type; drop incidental non-fence
suppressAssistantErrorPersistence passes; remove dead async benign-rewrite cluster
(sessionFence{Advance,Rewrite}IsBenign + readAppendedSessionFileText +
lineMatchesLinearTranscriptMigration + helpers) — our openclaw#84250-based assertSessionFileFence
uses the sync owned-write path, so the async benign-detection variants are unreachable.
tsgo core: 0 errors. 384 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [FORK][openclaw#86584] address codex review: prefix-validate benign advance + preserve provider error

Finding 2 (masking gap, P2): sessionFenceAdvanceIsBenignSync only validated the
APPENDED bytes, so a writer that rewrote the existing prefix AND appended a benign
delivery-mirror/gateway-injected line could be laundered as an owned advance —
masking a genuine external takeover (silent message loss). Now fail closed unless
the current prefix is byte-identical to the trusted readSessionFileFenceSnapshot
text (readSessionFilePrefixSync); absent snapshot text => not benign.

Finding 1 (provider-error masking, P2): wrappedStreamFn's finally let a
reacquireAfterPrompt() takeover error mask the original provider error when the
stream itself threw. Now only surface the reacquire error when the stream
succeeded; otherwise preserve the original failure.

tsgo core: 0 errors. 384 tests pass (benign-advance acceptance + external-mutation
rejection both green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(release): 2026.5.18-boon.1 — session-takeover hardening (boon fleet build)

Version bump + CHANGELOG for the fork build. Also fixes a backport test-import
gap: attempt.test.ts referenced `attemptTesting` (the __testing export) without
importing it. Full project typecheck (tsgo -b tsconfig.projects.json): 0 errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ci): no-unsafe-finally in wrappedStreamFn + drop collateral protocol/test churn

- wrappedStreamFn: restructure provider-error-preservation without a throw inside
  finally (oxlint no-unsafe-finally). Same semantics: always reacquire; prefer the
  original stream error over a reacquire takeover error; surface reacquire error
  only when the stream succeeded.
- Revert src/gateway/server-methods/agent.test.ts + GatewayModels.swift to the 5.18
  baseline: the openclaw#85764 cherry-pick conflict-resolution had pulled in openclaw#85256-era
  internal-session-effect tests + protocol fields whose implementation isn't in this
  backport, breaking checks-node-agentic-gateway-methods + checks-fast-bundled-protocol.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: remove vestigial onAssistantErrorMessagePersisted option decls

Address cubic P2 review (PR #2): the option was declared on the guard
and guard-wrapper option types but never forwarded or invoked, so any
provided callback was silently ignored. The companion error-suppression
feature (suppressAssistantErrorPersistence + the agent-runner/followup
caller chain) is deliberately scoped OUT of this 5.18 backport, so the
decls were dead plumbing left over from a cherry-pick. Remove them to
keep the option surface honest; the load-bearing beforeMessagePersist
fence checkpoint (openclaw#86572) is retained.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Yao <364939526@qq.com>
Co-authored-by: xiaotian <tianxiaochannel@gmail.com>
Co-authored-by: 狼哥 <hanwanlonga@gmail.com>
Co-authored-by: njuboy <njuboy11@gmail.com>
Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
Co-authored-by: Chunyue Wang <80630709+openperf@users.noreply.github.com>
Co-authored-by: openperf <16864032@qq.com>
Co-authored-by: Samuel Soares da Silva <samuelsoares177778@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
…penclaw#84321)

Summary:
- The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.
- PR surface: Source +52, Tests +92. Total +144 across 5 files.
- Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405…

Validation:
- ClawSweeper review passed for head 050c779.
- Required merge gates passed before the squash merge.

Prepared head SHA: 050c779
Review: openclaw#84321 (comment)

Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
…penclaw#84321)

Summary:
- The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.
- PR surface: Source +52, Tests +92. Total +144 across 5 files.
- Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405…

Validation:
- ClawSweeper review passed for head 050c779.
- Required merge gates passed before the squash merge.

Prepared head SHA: 050c779
Review: openclaw#84321 (comment)

Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
…penclaw#84321)

Summary:
- The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.
- PR surface: Source +52, Tests +92. Total +144 across 5 files.
- Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405…

Validation:
- ClawSweeper review passed for head 050c779.
- Required merge gates passed before the squash merge.

Prepared head SHA: 050c779
Review: openclaw#84321 (comment)

Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge clawsweeper Tracked by ClawSweeper automation merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: S status: 🚀 automerge armed This PR is in ClawSweeper's automerge lane.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants