fix(agents): release yield abort session lock by anyech · Pull Request #86455 · openclaw/openclaw

anyech · 2026-05-25T11:29:19Z

Summary

Release the embedded attempt session write lock before the sessions_yield abort cleanup path waits for session events and rewrites the yielded-parent transcript artifacts.

The sessions_yield abort path already has a bounded settle wait. After that bounded wait, the parent run can still hold the coarse embedded-attempt transcript lock while it drains session events and then performs the yield cleanup write. If a subagent completion callback tries to write back to the yielded parent at that point, it can contend on the same parent transcript lock and time out with SessionWriteLockTimeoutError.

This change adds a narrow defensive release for that abort-cleanup path so the subsequent yield cleanup write reacquires a short lock through the existing withSessionWriteLock path instead of keeping the coarse attempt lock held across the wait/drain boundary.

What changed

Add releaseHeldLockForAbort() to EmbeddedAttemptSessionLockController.
- Idempotent if no lock is currently held.
- Releases only the currently held embedded-attempt lock.
- Does not acquire a new lock and does not introduce any unbounded wait.
Call releaseHeldLockForAbort() in the sessions_yield abort cleanup path after the existing bounded settle wait and before waitForSessionEvents(...) / yield artifact cleanup writes.
Add a focused regression test proving abort cleanup can release the coarse lock and then reacquire a short cleanup lock for the yield transcript rewrite.

Why this is separate from #85716

#85716 overlaps the same failure cluster, but its sessions_yield change waits for the original settle promise after timeout. That may help one path, but it can also become unbounded if the settle promise is the stuck operation.

This PR keeps the existing bounded wait behavior and fixes the narrower lock-lifecycle hazard: do not hold the embedded attempt's coarse parent transcript lock while draining/writing the yield-abort cleanup path.

It intentionally does not implement a durable completion inbox, longwork ownership API, or broader subagent reporter architecture. Those are separate design tracks.

Validation

Validation was run in a disposable source checkout. No live Gateway, watched config, production runtime state, or live channel was used.

Passed:

git diff --check

Passed:

OPENCLAW_TEST_FAST=1 OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs run --config test/vitest/vitest.agents-pi-embedded.config.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts

Result:

Test Files  1 passed (1)
Tests       33 passed (33)

Passed:

node scripts/run-oxlint.mjs \
  src/agents/pi-embedded-runner/run/attempt.session-lock.ts \
  src/agents/pi-embedded-runner/run/attempt.ts \
  src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts

Result:

Found 0 warnings and 0 errors.

Real behavior proof

Behavior or issue addressed: sessions_yield abort cleanup should not keep the parent transcript's coarse embedded-attempt lock held while later session-event waits and yield cleanup writes run. Holding that lock can cause child completion callbacks targeting the yielded parent to time out on the parent .jsonl.lock.
Real environment tested: local OpenClaw checkout at PR head 7463aac49197, running a runtime helper against the actual createEmbeddedAttemptSessionLockController() implementation with a temporary session transcript. No production Gateway, watched config, or live channel was used.
Exact steps or command run after this patch:

PROOF_HEAD=$(git rev-parse --short=12 HEAD) pnpm exec tsx /tmp/openclaw-pr86455-proof.mts

Evidence after fix: copied terminal output from the runtime helper, outside Vitest. It imports the real lock controller, creates a temporary session transcript, releases the held abort lock, then performs the yield cleanup write through withSessionWriteLock().

OpenClaw PR #86455 runtime helper proof
head=7463aac49197
scenario=sessions_yield abort cleanup releases coarse lock before cleanup write
event=acquire:coarse-attempt:session
event=release:coarse-attempt
event=acquire:yield-cleanup:session
event=write:yield-cleanup
event=release:yield-cleanup
result=PASS

Observed result after fix: abort cleanup releases the coarse embedded-attempt session lock before cleanup work, then the cleanup write reacquires and releases a separate short session lock. That removes the lock contention window where child completion callbacks could time out on the parent session lock.
What was not tested: no live Gateway or live channel reproduction was run. The runtime-helper proof exercises the real lock lifecycle; the focused Vitest regression and CI are supplemental.

Risk notes

releaseHeldLockForAbort() intentionally does less than releaseForPrompt(): it does not refresh the session-file fence or mark a normal prompt-release transition. That is deliberate because this path is abort cleanup, and the subsequent transcript rewrite already goes through withSessionWriteLock() and the existing reacquire/fence path.
If maintainers prefer not to expose another controller method, this can be refactored into a private helper shared by the existing release paths. I kept the method explicit to keep the abort-cleanup semantics readable.

Maintainer proof refresh

Behavior addressed: sessions_yield abort cleanup releases the coarse embedded attempt session lock before waiting for session events and reacquiring the short cleanup write lock.
Real environment tested: local Node 24.15.0 focused embedded-attempt session-lock tests.
Exact steps or command run after this patch: CI=1 OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=120000 fnm exec --using v24.15.0 -- node scripts/run-vitest.mjs run src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts --reporter=dot --pool=forks --no-file-parallelism
Evidence after fix: 2 files, 70 tests passed.
Observed result after fix: abort cleanup no longer reuses the held coarse attempt lock, and transcript mutation still happens under a reacquired session write lock.
What was not tested: a live multi-agent sessions_yield abort race; the focused lock lifecycle regression covers the implicated ordering.

clawsweeper · 2026-05-25T11:32:09Z

Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 9:18 PM ET / 01:18 UTC.

Summary
The PR adds an abort-only embedded attempt lock release before sessions_yield abort cleanup drains session events and rewrites yield transcript artifacts, plus focused session-lock tests.

PR surface: Source +9, Tests +41. Total +50 across 3 files.

Reproducibility: yes. for the focused lock lifecycle: current main keeps the reacquired coarse lock through the abort cleanup drain/write path, and the PR proof exercises the real controller release/reacquire sequence. I did not run a live multi-agent Gateway race.

Review metrics: none identified.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

The PR intentionally releases the parent session write lock earlier in the sessions_yield abort path; focused fence/reacquire tests cover the intended lifecycle, but no live multi-agent Gateway race proof was run.

Maintainer options:

Accept the focused lifecycle proof (recommended)
Maintainers can land after required checks if they accept the controller-level proof and regression tests as sufficient for this abort-lock race.
Request live race proof first
Maintainers can ask for a live multi-agent sessions_yield abort/completion run if the remaining race concern needs transport-level evidence before merge.

Next step before merge
No automated repair is indicated because there are no discrete findings; the next action is maintainer review of the intentional session-state lifecycle change and live-proof threshold.

Security
Cleared: No security or supply-chain concern found; the diff only touches internal TypeScript session-lock/runtime code and colocated tests.

Review details

Best possible solution:

Land the narrow abort-path lock release after maintainer acceptance of the session-state lifecycle tradeoff and current checks, or request one live multi-agent race proof if that evidence is required before merge.

Do we have a high-confidence way to reproduce the issue?

Yes for the focused lock lifecycle: current main keeps the reacquired coarse lock through the abort cleanup drain/write path, and the PR proof exercises the real controller release/reacquire sequence. I did not run a live multi-agent Gateway race.

Is this the best way to solve the issue?

Yes, the patch is the narrowest maintainable shape I found: it reuses the existing fence-preserving release logic and lets cleanup writes reacquire via the established session write lock path. The safer alternative is not a code change but additional live proof before landing.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against d353dc128fab.

Label changes

Label changes:

add rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
remove rating: 🐚 platinum hermit: Current PR rating is rating: 🦞 diamond lobster, so this older rating label is no longer current.

Label justifications:

P1: The PR targets a reported subagent/session workflow failure where a held parent transcript lock can cause completion callback timeouts and missed delivery.
merge-risk: 🚨 session-state: The patch changes when an embedded attempt releases and reacquires the parent session transcript lock during abort cleanup, which can affect session ordering and takeover handling.
rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body supplies terminal output from a runtime helper using the real lock controller and a temporary transcript, which is sufficient for this non-visual lock-lifecycle change despite no live Gateway proof.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies terminal output from a runtime helper using the real lock controller and a temporary transcript, which is sufficient for this non-visual lock-lifecycle change despite no live Gateway proof.

Evidence reviewed

PR surface:

Source +9, Tests +41. Total +50 across 3 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	2	25	16	+9
Tests	1	41	0	+41
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	3	66	16	+50

What I checked:

Current-main behavior: Current main waits for sessions_yield abort settle, then drains session events and rewrites yield artifacts without releasing the reacquired embedded-attempt lock first. (src/agents/pi-embedded-runner/run/attempt.ts:4423, d353dc128fab)
PR implementation: The PR inserts releaseHeldLockForAbort() after the bounded settle wait and before event drain plus yield cleanup writes, so cleanup later reacquires through withSessionWriteLock(). (src/agents/pi-embedded-runner/run/attempt.ts:4428, 73bd1b1f46f9)
Lock controller implementation: The new abort release shares the existing fence-preserving release path with releaseForPrompt(), keeping the session-file fence active before releasing the held coarse lock. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:752, 73bd1b1f46f9)
Regression coverage: The added tests prove abort cleanup releases the coarse lock before a short cleanup write and that the takeover fence remains active after the abort release. (src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts:110, 73bd1b1f46f9)
Proof supplied: The PR body includes copied terminal output from a runtime helper that imports the real lock controller, releases the abort-held lock, and then performs the cleanup write through withSessionWriteLock(); it also reports focused tests passing under Node 24.15.0. (73bd1b1f46f9)
Static diff check: The PR diff has no whitespace errors in the inspected range against current main. (73bd1b1f46f9)

Likely related people:

steipete: Peter Steinberger split embedded attempt helpers and is the heaviest recent contributor in the central runner files; he also authored the current PR head commit that fences this abort release. (role: recent area contributor; confidence: high; commits: 0d9e4f20d5f6, 73bd1b1f46f9; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts, src/agents/pi-embedded-runner/run/attempt.session-lock.ts)
jriff: Jacob Riff introduced the original sessions_yield cooperative turn-ending runner behavior in the feature commit that this PR now adjusts. (role: introduced behavior; confidence: medium; commits: 3fa91cd69d5d; files: src/agents/pi-embedded-runner/run/attempt.ts)
jalehman: Josh Lehman reviewed/coauthored the original sessions_yield feature and later worked on context-engine transcript maintenance adjacent to the session transcript lifecycle. (role: reviewer and adjacent owner; confidence: medium; commits: 3fa91cd69d5d, 751d5b7849ca; files: src/agents/pi-embedded-runner/run/attempt.ts)
vincentkoc: Vincent Koc made recent embedded runner changes involving lock-sensitive continuation behavior in the same central attempt file. (role: recent adjacent contributor; confidence: medium; commits: 9ba97ceaed89; files: src/agents/pi-embedded-runner/run/attempt.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

clawsweeper · 2026-05-25T11:38:49Z

ClawSweeper PR egg

✨ Hatched: 🥚 common Cosmic Crabkin

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

Merged PRs are hatchable.
Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: watches the merge queue.
Image traits: location green-check meadow; accessory CI status badge; palette moss green and polished brass; mood determined; pose leaning over a miniature review desk; shell smooth pearl shell; lighting moonlit rim light; background smooth stones and checkmarks.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Cosmic Crabkin in ClawSweeper.

What is this egg doing here?

Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

martingarramon

Core approach is sound. releaseHeldLockForAbort() is correctly scoped — idempotent, doesn't reacquire, keeps the abort-cleanup path within the existing withSessionWriteLock() reacquire flow. The focused test suite (33/33 passing) covers the narrow lock-lifecycle hazard being addressed.

Two CI notes for the author:

check-dependencies failure is pre-existing main-churn. The flagged file ui/src/ui/control-ui-chunking.ts has nothing to do with this PR — it's a UI-subsystem dead file already present in the base branch. You can confirm this with git log -- ui/src/ui/control-ui-chunking.ts: the last commit touching it precedes this branch. No change needed on your side; this will resolve when the base branch is cleaned up or CI skips it.

Real behavior proof field name mismatch. The check runner is rejecting the steps field. Your PR body uses:

Exact steps run after this patch:

but the accepted list requires one of:

Exact steps or command run after this patch
Exact steps or command run after the patch
Exact steps or command run after fix
Steps run after the patch

Add or command run after to your existing label (e.g. **Exact steps or command run after this patch:**) and the proof gate should clear.

Release the embedded attempt session lock before sessions_yield abort cleanup waits for session events and rewrites yielded-parent artifacts. This keeps the existing bounded settle wait while preventing child completion callbacks from contending on the coarse parent transcript lock. Adds focused session-lock lifecycle coverage.

steipete · 2026-05-27T01:32:43Z

Maintainer proof refresh for head fe35122.

Behavior addressed: sessions_yield abort cleanup releases the coarse embedded attempt session lock before draining session events and before the short cleanup transcript write reacquires the session lock.
Real environment tested: local OpenClaw checkout, focused lock-controller tests and the failing runtime-config cache shard from CI.
Exact steps or command run after this patch:

node scripts/run-vitest.mjs run --config test/vitest/vitest.runtime-config.config.ts src/config/sessions.cache.test.ts --reporter=dot
node scripts/run-oxlint.mjs src/config/sessions/store-cache.ts src/config/sessions.cache.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.ts src/agents/pi-embedded-runner/run/attempt.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
git diff --check
.agents/skills/autoreview/scripts/autoreview --mode local

Evidence after fix: sessions.cache.test.ts passed, 27 tests; oxlint passed; git diff --check passed; autoreview reported no accepted/actionable findings.
Observed result after fix: the abort release uses the same fence-preserving release helper as prompt release, and the CI-red cache shard no longer reparses a serialized store when the caller already passes the in-memory store object.
What was not tested: no live multi-agent Gateway race; the focused lifecycle regression and local cache-shard repro cover the changed code paths.

openclaw-barnacle Bot added agents Agent runtime and tooling size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 25, 2026

martingarramon reviewed May 25, 2026

View reviewed changes

steipete force-pushed the fix/sessions-yield-abort-lock-release branch from b9004b2 to 7463aac Compare May 25, 2026 12:32

anyech force-pushed the fix/sessions-yield-abort-lock-release branch from 7463aac to ccd48bb Compare May 25, 2026 21:16

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026

steipete force-pushed the fix/sessions-yield-abort-lock-release branch from ccd48bb to 7625471 Compare May 25, 2026 23:25

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026

steipete self-assigned this May 27, 2026

fix(agents): fence yield abort lock release

73bd1b1

steipete force-pushed the fix/sessions-yield-abort-lock-release branch from 7625471 to 73bd1b1 Compare May 27, 2026 01:10

openclaw-barnacle Bot added size: S and removed size: XS labels May 27, 2026

clawsweeper Bot added rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 27, 2026

fix(config): avoid reparsing cached session stores

fe35122

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 27, 2026

steipete merged commit 0fe7479 into openclaw:main May 27, 2026
100 of 102 checks passed

Haderach-Ram mentioned this pull request May 27, 2026

Ecosystem Digest — 2026-05-27 Haderach-Ram/openclaw-radar#20

Open

augusteo mentioned this pull request May 29, 2026

2026.5.18-boon.1 — session-takeover hardening getboon/openclaw#2

Merged

clawsweeper Bot mentioned this pull request Jun 11, 2026

Multi-agent orchestration is unstable: concurrent agents add/config overwrites, session-lock failures, and detached child work #43367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): release yield abort session lock#86455

fix(agents): release yield abort session lock#86455
steipete merged 3 commits into
openclaw:mainfrom
anyech:fix/sessions-yield-abort-lock-release

anyech commented May 25, 2026 •

edited by steipete

Loading

Uh oh!

clawsweeper Bot commented May 25, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 25, 2026 •

edited

Loading

Uh oh!

martingarramon left a comment

Uh oh!

steipete commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

anyech commented May 25, 2026 • edited by steipete Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why this is separate from #85716

Validation

Real behavior proof

Risk notes

Maintainer proof refresh

Uh oh!

clawsweeper Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hatch command

Uh oh!

martingarramon left a comment

Choose a reason for hiding this comment

Uh oh!

steipete commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anyech commented May 25, 2026 •

edited by steipete

Loading

clawsweeper Bot commented May 25, 2026 •

edited

Loading

clawsweeper Bot commented May 25, 2026 •

edited

Loading