Skip to content

2026.5.18-boon.1 — session-takeover hardening#2

Merged
augusteo merged 15 commits into
boonfrom
boon-takeover-hardening-518
May 29, 2026
Merged

2026.5.18-boon.1 — session-takeover hardening#2
augusteo merged 15 commits into
boonfrom
boon-takeover-hardening-518

Conversation

@augusteo

@augusteo augusteo commented May 29, 2026

Copy link
Copy Markdown

Fork fleet build atop upstream v2026.5.18 to stop the recurring EmbeddedAttemptSessionTakeoverError → SIGABRT crash class on the Mac Minis (gandalf hit it 5×/4 days).

What

Fork hardening (internal codex review)

  • Benign session-fence advance now fails closed unless the fenced prefix is byte-identical to the trusted snapshot — prevents a prefix-rewrite + benign-append from masking a real takeover (silent message loss).
  • A reacquire takeover error no longer masks the original provider error.

Validation

Deploy

Pinned npm i -g install on each Mac Mini; soak watching crash count and silent message loss; rollback = npm i -g openclaw@2026.5.18.

🤖 Generated with Claude Code


Summary by cubic

Forked build 2026.5.18-boon.1 hardens session takeover handling to stop the recurring EmbeddedAttemptSessionTakeoverError → SIGABRT on Mac Minis and reduces silent message loss. Backports key session-lock fixes and adds a stricter fence with owned-write wiring.

  • Bug Fixes
    • Backported 9 upstream session-lock race fixes post-5.18; carried fix(agents): gate owned-write publish on pre-append fingerprint (#86572) openclaw/openclaw#86584 (pre-append fingerprint gate) with owned transcript write wiring; skipped Fix embedded session file ownership race openclaw/openclaw#87159 due to risk.
    • Tightened the session fence: benign advances now require a byte-identical prefix to the trusted snapshot, blocking prefix-rewrite + benign-append laundering.
    • Preserved original provider errors when cleanup reacquire detects takeover; takeover still aborts fallback.
    • Improved write-lock lifecycle: release on all exits, bound post-prompt holds to compaction timeout, reclaim live holders that exceed maxHoldMs, add dead-PID fast path, and reduce contention with event-loop yields and PID-args memoization.
    • Model fallback now aborts on local coordination errors (embedded takeover, write-lock timeout) instead of trying every model.
    • Diagnostics: classify orphaned stale model/tool activity as recoverable session.stuck and release queued lanes; added openclaw.session.turn.created metric.
    • Gateway: reject public internal session-effect controls (sessionEffects, suppressPromptPersistence); updated Swift protocol to include these fields for backend-only use.

Written for commit 31d78b6. Summary will update on new commits.

Review in cubic

luyao618 and others added 13 commits May 28, 2026 20:37
Preserve provider fallback metadata when session coordination errors are nested under provider failures.

Co-authored-by: luyao618 <364939526@qq.com>
(cherry picked from commit 6a5a135)
…openclaw#84250)

Merged via squash.

Prepared head SHA: 33f88fe
Co-authored-by: tianxiaochannel-oss88 <272340815+tianxiaochannel-oss88@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman

(cherry picked from commit 1b77145)
Fixes the embedded attempt session write-lock watchdog so the fallback max hold time follows the resolved compaction timeout plus the existing lock grace window, instead of inheriting the full run timeout.

Adds regression coverage for the helper and settled-compaction lock lifecycle, plus a changelog entry thanking @luoyanglang.

Verification:
- `pnpm test src/agents/session-write-lock.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts`
- `pnpm check:changed` via Blacksmith Testbox `tbx_01ks8b6vn8se5cg1dfn3te3g47` / https://github.com/openclaw/openclaw/actions/runs/26301988670
- Autoreview clean: `/Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode branch --base origin/main`
- PR CI green on `79e8c5f1a637981d263c0268bf5666967ff4e778`: https://github.com/openclaw/openclaw/actions/runs/26302152844 and https://github.com/openclaw/openclaw/actions/runs/26302152798

Co-authored-by: luoyanglang <hanwanlonga@gmail.com>
(cherry picked from commit 46de078)
…uisition (openclaw#85764)

* fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition

- Adds optional maxHoldMs parameter to inspectLockPayload
- Inspect now marks locks as stale when held longer than maxHoldMs
- Passes maxHoldMs through inspectLockPayloadForSession
- acquireSessionWriteLock's shouldReclaim callback now passes maxHoldMs

This ensures that when a live process holds a lock for longer than
maxHoldMs (default 5min), other processes can reclaim it during
acquisition — matching the watchdog's existing enforcement.

Previously shouldReclaim only used staleMs (30min default), meaning
a lock held for 10+ minutes by a live PID would never be reclaimable,
causing 60s timeout failures and gateway freezes.

Closes openclaw#85762

* fix(session-lock): add dead-PID fast-path before retry loop

Adds a fast-path check at the top of acquireSessionWriteLock:
if the lock file's owner PID is dead, remove it immediately
before entering the retry loop. This saves up to timeoutMs (60s)
of futile waiting when the previous lock holder has died.

The shouldReclaim callback already handles this case, but only
iteratively through the retry loop. The fast-path eliminates
that unnecessary delay.

* fix(session-lock): enforce max hold during acquisition

* fix(session-lock): revalidate max hold safely

* fix(session-lock): honor holder max-hold policy

* fix(session-lock): keep cleanup from reclaiming live holders

* fix(session-lock): remove stale locks only when unchanged

* fix(session-lock): skip self-held max-hold reclaim

* fix(ci): refresh gateway protocol checks

---------

Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
(cherry picked from commit a1eb765)
…penclaw#84321)

Summary:
- The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions.
- PR surface: Source +52, Tests +92. Total +144 across 5 files.
- Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure.

Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405…

Validation:
- ClawSweeper review passed for head 050c779.
- Required merge gates passed before the squash merge.

Prepared head SHA: 050c779
Review: openclaw#84321 (comment)

Co-authored-by: abnershang <abner.shang@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
(cherry picked from commit 7fbca96)
…openclaw#86427)

* fix(agents): release embedded-attempt session lock on every exit path

The embedded run controller acquires its session write lock eagerly at
creation and released it only inside the post-run cleanup block. An
exception thrown in post-prompt processing skipped that block, so the lock
leaked to the live gateway process until the watchdog reclaimed it and
later requests to the session failed with SessionWriteLockTimeoutError.

Add an idempotent dispose() to the lock controller and call it from the
run's outer finally so the eagerly-held lock is released on every exit
path. Normal/aborted/timed-out runs still hand the lock to
acquireForCleanup first, so dispose() is a no-op then (no double release).

Fixes openclaw#86014

* fix: keep session lock teardown comment lean

* docs(changelog): note embedded session lock fix

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
(cherry picked from commit 32ddfc2)
Memoize owner process argv lookups per PID during `cleanStaleLockFiles`, and yield between lock entries so startup cleanup does not monopolize the event loop while inspecting many session locks.

This keeps lock classification semantics unchanged while avoiding repeated synchronous process-args reads for lock clusters owned by the same PID, especially the Windows PowerShell path.

Fixes openclaw#86509.

Verification:
- `git diff --check origin/main...HEAD`
- focused TSX harness against the current-main merge result: `session-lock memo regression harness passed`

Thanks @openperf.

Co-authored-by: openperf <16864032@qq.com>
(cherry picked from commit c430fcd)
Recover idle queued sessions whose diagnostic activity retained stale ownerless model or tool calls by classifying them as recoverable session.stuck after the usual recovery gates. Yield the event loop before stale session-lock process inspection so sync process lookup cannot monopolize lock contention paths.

Docs now describe the widened session.stuck telemetry contract for recoverable stale bookkeeping, including ownerless activity. Thanks @samuelsoaress.

Refs openclaw#84903.

Co-authored-by: samuelsoaress <samuelsoares177778@gmail.com>
(cherry picked from commit 286964c)
…rint (fixes openclaw#86572)

Carries unmerged upstream PR openclaw#86584 (HEAD d79a3b4) onto the boon 5.18 base
as the same-lane EmbeddedAttemptSessionTakeoverError fence fix for long cron
turns. Fails closed: an external mutation before pi's append fails the trust
gate and still trips the fence (verified by the PR's 303-line test suite incl.
the mixed-interleave negative test).

Backfills base symbols openclaw#86584 assumes (introduced upstream between 5.18 and the
PR base, not carried by the 9 merged race-fix picks):
- session-lock.ts: MAX_BENIGN_SESSION_FENCE_{ADVANCE,REWRITE,REWRITE_RESULT}_BYTES,
  MAX_SAFE_FILE_OFFSET, TRANSCRIPT_ONLY_OPENCLAW_ASSISTANT_MODELS,
  SessionFileFenceSnapshot type, fenceSnapshot state var, ActiveWriteLockState
  type + activeWriteLock store fix (reuse nested writes via {active:true}),
  node:util + string-normalization imports.
- transcript-append.ts: wrap appendSessionTranscriptMessage in
  runWithOwnedSessionTranscriptWriteLock so low-level appends acquire the
  owned-context lock.
- test import fixes (appendSessionTranscriptMessage, withOwned/bindOwned, __testing).

Drop when upstream merges openclaw#86584.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…k cleanup

CRITICAL: wrap promptActiveSession in withOwnedSessionTranscriptWrites and bind
onBlockReply/onBlockReplyFlush to the owned context in attempt.ts. Without this,
pi's own transcript appends during a prompt are NOT recorded as owned, so the
fence trips on them (the exact takeover the backport is meant to prevent). This
wiring is an intermediate-base feature (between 5.18 and openclaw#84250's base) the merged
picks didn't carry. Tests passed before only because they set the context manually.

Also: add releaseHeldLockForAbort to the controller type; drop incidental non-fence
suppressAssistantErrorPersistence passes; remove dead async benign-rewrite cluster
(sessionFence{Advance,Rewrite}IsBenign + readAppendedSessionFileText +
lineMatchesLinearTranscriptMigration + helpers) — our openclaw#84250-based assertSessionFileFence
uses the sync owned-write path, so the async benign-detection variants are unreachable.
tsgo core: 0 errors. 384 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dvance + preserve provider error

Finding 2 (masking gap, P2): sessionFenceAdvanceIsBenignSync only validated the
APPENDED bytes, so a writer that rewrote the existing prefix AND appended a benign
delivery-mirror/gateway-injected line could be laundered as an owned advance —
masking a genuine external takeover (silent message loss). Now fail closed unless
the current prefix is byte-identical to the trusted readSessionFileFenceSnapshot
text (readSessionFilePrefixSync); absent snapshot text => not benign.

Finding 1 (provider-error masking, P2): wrappedStreamFn's finally let a
reacquireAfterPrompt() takeover error mask the original provider error when the
stream itself threw. Now only surface the reacquire error when the stream
succeeded; otherwise preserve the original failure.

tsgo core: 0 errors. 384 tests pass (benign-advance acceptance + external-mutation
rejection both green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…leet build)

Version bump + CHANGELOG for the fork build. Also fixes a backport test-import
gap: attempt.test.ts referenced `attemptTesting` (the __testing export) without
importing it. Full project typecheck (tsgo -b tsconfig.projects.json): 0 errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31d78b6a1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 29 files

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread apps/shared/OpenClawKit/Sources/OpenClawProtocol/GatewayModels.swift Outdated
Comment thread src/agents/session-tool-result-guard-wrapper.ts
Comment thread src/agents/session-tool-result-guard.ts Outdated
Comment thread src/agents/pi-embedded-runner/run/attempt.session-lock.ts
augusteo and others added 2 commits May 29, 2026 08:11
…col/test churn

- wrappedStreamFn: restructure provider-error-preservation without a throw inside
  finally (oxlint no-unsafe-finally). Same semantics: always reacquire; prefer the
  original stream error over a reacquire takeover error; surface reacquire error
  only when the stream succeeded.
- Revert src/gateway/server-methods/agent.test.ts + GatewayModels.swift to the 5.18
  baseline: the openclaw#85764 cherry-pick conflict-resolution had pulled in openclaw#85256-era
  internal-session-effect tests + protocol fields whose implementation isn't in this
  backport, breaking checks-node-agentic-gateway-methods + checks-fast-bundled-protocol.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address cubic P2 review (PR #2): the option was declared on the guard
and guard-wrapper option types but never forwarded or invoked, so any
provided callback was silently ignored. The companion error-suppression
feature (suppressAssistantErrorPersistence + the agent-runner/followup
caller chain) is deliberately scoped OUT of this 5.18 backport, so the
decls were dead plumbing left over from a cherry-pick. Remove them to
keep the option surface honest; the load-bearing beforeMessagePersist
fence checkpoint (openclaw#86572) is retained.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@augusteo augusteo merged commit a50a6b2 into boon May 29, 2026
74 of 83 checks passed
@augusteo augusteo deleted the boon-takeover-hardening-518 branch May 29, 2026 17:01
insomnius added a commit that referenced this pull request Jun 3, 2026
Bump version to capture PR #4 (bundled Sentry monitor) on top of
boon.1 (session-takeover hardening, PR #2). Base = stock 2026.5.18.

ENG-12565.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants