Skip to content

fix(agents): release yield abort session lock#86455

Merged
steipete merged 3 commits into
openclaw:mainfrom
anyech:fix/sessions-yield-abort-lock-release
May 27, 2026
Merged

fix(agents): release yield abort session lock#86455
steipete merged 3 commits into
openclaw:mainfrom
anyech:fix/sessions-yield-abort-lock-release

Conversation

@anyech

@anyech anyech commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #85953.

Release the embedded attempt session write lock before the sessions_yield abort cleanup path waits for session events and rewrites the yielded-parent transcript artifacts.

The sessions_yield abort path already has a bounded settle wait. After that bounded wait, the parent run can still hold the coarse embedded-attempt transcript lock while it drains session events and then performs the yield cleanup write. If a subagent completion callback tries to write back to the yielded parent at that point, it can contend on the same parent transcript lock and time out with SessionWriteLockTimeoutError.

This change adds a narrow defensive release for that abort-cleanup path so the subsequent yield cleanup write reacquires a short lock through the existing withSessionWriteLock path instead of keeping the coarse attempt lock held across the wait/drain boundary.

What changed

  • Add releaseHeldLockForAbort() to EmbeddedAttemptSessionLockController.
    • Idempotent if no lock is currently held.
    • Releases only the currently held embedded-attempt lock.
    • Does not acquire a new lock and does not introduce any unbounded wait.
  • Call releaseHeldLockForAbort() in the sessions_yield abort cleanup path after the existing bounded settle wait and before waitForSessionEvents(...) / yield artifact cleanup writes.
  • Add a focused regression test proving abort cleanup can release the coarse lock and then reacquire a short cleanup lock for the yield transcript rewrite.

Why this is separate from #85716

#85716 overlaps the same failure cluster, but its sessions_yield change waits for the original settle promise after timeout. That may help one path, but it can also become unbounded if the settle promise is the stuck operation.

This PR keeps the existing bounded wait behavior and fixes the narrower lock-lifecycle hazard: do not hold the embedded attempt's coarse parent transcript lock while draining/writing the yield-abort cleanup path.

It intentionally does not implement a durable completion inbox, longwork ownership API, or broader subagent reporter architecture. Those are separate design tracks.

Validation

Validation was run in a disposable source checkout. No live Gateway, watched config, production runtime state, or live channel was used.

Passed:

git diff --check

Passed:

OPENCLAW_TEST_FAST=1 OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs run --config test/vitest/vitest.agents-pi-embedded.config.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts

Result:

Test Files  1 passed (1)
Tests       33 passed (33)

Passed:

node scripts/run-oxlint.mjs \
  src/agents/pi-embedded-runner/run/attempt.session-lock.ts \
  src/agents/pi-embedded-runner/run/attempt.ts \
  src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts

Result:

Found 0 warnings and 0 errors.

Real behavior proof

  • Behavior or issue addressed: sessions_yield abort cleanup should not keep the parent transcript's coarse embedded-attempt lock held while later session-event waits and yield cleanup writes run. Holding that lock can cause child completion callbacks targeting the yielded parent to time out on the parent .jsonl.lock.
  • Real environment tested: local OpenClaw checkout at PR head 7463aac49197, running a runtime helper against the actual createEmbeddedAttemptSessionLockController() implementation with a temporary session transcript. No production Gateway, watched config, or live channel was used.
  • Exact steps or command run after this patch:
PROOF_HEAD=$(git rev-parse --short=12 HEAD) pnpm exec tsx /tmp/openclaw-pr86455-proof.mts
  • Evidence after fix: copied terminal output from the runtime helper, outside Vitest. It imports the real lock controller, creates a temporary session transcript, releases the held abort lock, then performs the yield cleanup write through withSessionWriteLock().
OpenClaw PR #86455 runtime helper proof
head=7463aac49197
scenario=sessions_yield abort cleanup releases coarse lock before cleanup write
event=acquire:coarse-attempt:session
event=release:coarse-attempt
event=acquire:yield-cleanup:session
event=write:yield-cleanup
event=release:yield-cleanup
result=PASS
  • Observed result after fix: abort cleanup releases the coarse embedded-attempt session lock before cleanup work, then the cleanup write reacquires and releases a separate short session lock. That removes the lock contention window where child completion callbacks could time out on the parent session lock.
  • What was not tested: no live Gateway or live channel reproduction was run. The runtime-helper proof exercises the real lock lifecycle; the focused Vitest regression and CI are supplemental.

Risk notes

  • releaseHeldLockForAbort() intentionally does less than releaseForPrompt(): it does not refresh the session-file fence or mark a normal prompt-release transition. That is deliberate because this path is abort cleanup, and the subsequent transcript rewrite already goes through withSessionWriteLock() and the existing reacquire/fence path.
  • If maintainers prefer not to expose another controller method, this can be refactored into a private helper shared by the existing release paths. I kept the method explicit to keep the abort-cleanup semantics readable.

Maintainer proof refresh

Behavior addressed: sessions_yield abort cleanup releases the coarse embedded attempt session lock before waiting for session events and reacquiring the short cleanup write lock.
Real environment tested: local Node 24.15.0 focused embedded-attempt session-lock tests.
Exact steps or command run after this patch: CI=1 OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=120000 fnm exec --using v24.15.0 -- node scripts/run-vitest.mjs run src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts --reporter=dot --pool=forks --no-file-parallelism
Evidence after fix: 2 files, 70 tests passed.
Observed result after fix: abort cleanup no longer reuses the held coarse attempt lock, and transcript mutation still happens under a reacquired session write lock.
What was not tested: a live multi-agent sessions_yield abort race; the focused lock lifecycle regression covers the implicated ordering.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 9:18 PM ET / 01:18 UTC.

Summary
The PR adds an abort-only embedded attempt lock release before sessions_yield abort cleanup drains session events and rewrites yield transcript artifacts, plus focused session-lock tests.

PR surface: Source +9, Tests +41. Total +50 across 3 files.

Reproducibility: yes. for the focused lock lifecycle: current main keeps the reacquired coarse lock through the abort cleanup drain/write path, and the PR proof exercises the real controller release/reacquire sequence. I did not run a live multi-agent Gateway race.

Review metrics: none identified.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • The PR intentionally releases the parent session write lock earlier in the sessions_yield abort path; focused fence/reacquire tests cover the intended lifecycle, but no live multi-agent Gateway race proof was run.

Maintainer options:

  1. Accept the focused lifecycle proof (recommended)
    Maintainers can land after required checks if they accept the controller-level proof and regression tests as sufficient for this abort-lock race.
  2. Request live race proof first
    Maintainers can ask for a live multi-agent sessions_yield abort/completion run if the remaining race concern needs transport-level evidence before merge.

Next step before merge
No automated repair is indicated because there are no discrete findings; the next action is maintainer review of the intentional session-state lifecycle change and live-proof threshold.

Security
Cleared: No security or supply-chain concern found; the diff only touches internal TypeScript session-lock/runtime code and colocated tests.

Review details

Best possible solution:

Land the narrow abort-path lock release after maintainer acceptance of the session-state lifecycle tradeoff and current checks, or request one live multi-agent race proof if that evidence is required before merge.

Do we have a high-confidence way to reproduce the issue?

Yes for the focused lock lifecycle: current main keeps the reacquired coarse lock through the abort cleanup drain/write path, and the PR proof exercises the real controller release/reacquire sequence. I did not run a live multi-agent Gateway race.

Is this the best way to solve the issue?

Yes, the patch is the narrowest maintainable shape I found: it reuses the existing fence-preserving release logic and lets cleanup writes reacquire via the established session write lock path. The safer alternative is not a code change but additional live proof before landing.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against d353dc128fab.

Label changes

Label changes:

  • add rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • remove rating: 🐚 platinum hermit: Current PR rating is rating: 🦞 diamond lobster, so this older rating label is no longer current.

Label justifications:

  • P1: The PR targets a reported subagent/session workflow failure where a held parent transcript lock can cause completion callback timeouts and missed delivery.
  • merge-risk: 🚨 session-state: The patch changes when an embedded attempt releases and reacquires the parent session transcript lock during abort cleanup, which can affect session ordering and takeover handling.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body supplies terminal output from a runtime helper using the real lock controller and a temporary transcript, which is sufficient for this non-visual lock-lifecycle change despite no live Gateway proof.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies terminal output from a runtime helper using the real lock controller and a temporary transcript, which is sufficient for this non-visual lock-lifecycle change despite no live Gateway proof.
Evidence reviewed

PR surface:

Source +9, Tests +41. Total +50 across 3 files.

View PR surface stats
Area Files Added Removed Net
Source 2 25 16 +9
Tests 1 41 0 +41
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 3 66 16 +50

What I checked:

Likely related people:

  • steipete: Peter Steinberger split embedded attempt helpers and is the heaviest recent contributor in the central runner files; he also authored the current PR head commit that fences this abort release. (role: recent area contributor; confidence: high; commits: 0d9e4f20d5f6, 73bd1b1f46f9; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts, src/agents/pi-embedded-runner/run/attempt.session-lock.ts)
  • jriff: Jacob Riff introduced the original sessions_yield cooperative turn-ending runner behavior in the feature commit that this PR now adjusts. (role: introduced behavior; confidence: medium; commits: 3fa91cd69d5d; files: src/agents/pi-embedded-runner/run/attempt.ts)
  • jalehman: Josh Lehman reviewed/coauthored the original sessions_yield feature and later worked on context-engine transcript maintenance adjacent to the session transcript lifecycle. (role: reviewer and adjacent owner; confidence: medium; commits: 3fa91cd69d5d, 751d5b7849ca; files: src/agents/pi-embedded-runner/run/attempt.ts)
  • vincentkoc: Vincent Koc made recent embedded runner changes involving lock-sensitive continuation behavior in the same central attempt file. (role: recent adjacent contributor; confidence: medium; commits: 9ba97ceaed89; files: src/agents/pi-embedded-runner/run/attempt.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Cosmic Crabkin

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: watches the merge queue.
Image traits: location green-check meadow; accessory CI status badge; palette moss green and polished brass; mood determined; pose leaning over a miniature review desk; shell smooth pearl shell; lighting moonlit rim light; background smooth stones and checkmarks.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Cosmic Crabkin in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@martingarramon martingarramon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core approach is sound. releaseHeldLockForAbort() is correctly scoped — idempotent, doesn't reacquire, keeps the abort-cleanup path within the existing withSessionWriteLock() reacquire flow. The focused test suite (33/33 passing) covers the narrow lock-lifecycle hazard being addressed.

Two CI notes for the author:

check-dependencies failure is pre-existing main-churn. The flagged file ui/src/ui/control-ui-chunking.ts has nothing to do with this PR — it's a UI-subsystem dead file already present in the base branch. You can confirm this with git log -- ui/src/ui/control-ui-chunking.ts: the last commit touching it precedes this branch. No change needed on your side; this will resolve when the base branch is cleaned up or CI skips it.

Real behavior proof field name mismatch. The check runner is rejecting the steps field. Your PR body uses:

Exact steps run after this patch:

but the accepted list requires one of:

  • Exact steps or command run after this patch
  • Exact steps or command run after the patch
  • Exact steps or command run after fix
  • Steps run after the patch

Add or command run after to your existing label (e.g. **Exact steps or command run after this patch:**) and the proof gate should clear.

@steipete steipete force-pushed the fix/sessions-yield-abort-lock-release branch from b9004b2 to 7463aac Compare May 25, 2026 12:32
@openclaw-barnacle openclaw-barnacle Bot added triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 25, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 25, 2026
@anyech anyech force-pushed the fix/sessions-yield-abort-lock-release branch from 7463aac to ccd48bb Compare May 25, 2026 21:16
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026
@steipete steipete force-pushed the fix/sessions-yield-abort-lock-release branch from ccd48bb to 7625471 Compare May 25, 2026 23:25
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026
@steipete steipete self-assigned this May 27, 2026
Release the embedded attempt session lock before sessions_yield abort cleanup waits for session events and rewrites yielded-parent artifacts.

This keeps the existing bounded settle wait while preventing child completion callbacks from contending on the coarse parent transcript lock.

Adds focused session-lock lifecycle coverage.
@steipete steipete force-pushed the fix/sessions-yield-abort-lock-release branch from 7625471 to 73bd1b1 Compare May 27, 2026 01:10
@clawsweeper clawsweeper Bot added rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 27, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 27, 2026
@steipete

Copy link
Copy Markdown
Contributor

Maintainer proof refresh for head fe35122.

Behavior addressed: sessions_yield abort cleanup releases the coarse embedded attempt session lock before draining session events and before the short cleanup transcript write reacquires the session lock.
Real environment tested: local OpenClaw checkout, focused lock-controller tests and the failing runtime-config cache shard from CI.
Exact steps or command run after this patch:

node scripts/run-vitest.mjs run --config test/vitest/vitest.runtime-config.config.ts src/config/sessions.cache.test.ts --reporter=dot
node scripts/run-oxlint.mjs src/config/sessions/store-cache.ts src/config/sessions.cache.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.ts src/agents/pi-embedded-runner/run/attempt.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
git diff --check
.agents/skills/autoreview/scripts/autoreview --mode local

Evidence after fix: sessions.cache.test.ts passed, 27 tests; oxlint passed; git diff --check passed; autoreview reported no accepted/actionable findings.
Observed result after fix: the abort release uses the same fence-preserving release helper as prompt release, and the CI-red cache shard no longer reparses a serialized store when the caller already passes the in-memory store object.
What was not tested: no live multi-agent Gateway race; the focused lifecycle regression and local cache-shard repro cover the changed code paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. size: S status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: sessions_yield can leave parent session transcript lock held, causing subagent completion callback timeout

3 participants