fix(agents): release yield abort session lock#86455
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 9:18 PM ET / 01:18 UTC. Summary PR surface: Source +9, Tests +41. Total +50 across 3 files. Reproducibility: yes. for the focused lock lifecycle: current main keeps the reacquired coarse lock through the abort cleanup drain/write path, and the PR proof exercises the real controller release/reacquire sequence. I did not run a live multi-agent Gateway race. Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the narrow abort-path lock release after maintainer acceptance of the session-state lifecycle tradeoff and current checks, or request one live multi-agent race proof if that evidence is required before merge. Do we have a high-confidence way to reproduce the issue? Yes for the focused lock lifecycle: current main keeps the reacquired coarse lock through the abort cleanup drain/write path, and the PR proof exercises the real controller release/reacquire sequence. I did not run a live multi-agent Gateway race. Is this the best way to solve the issue? Yes, the patch is the narrowest maintainable shape I found: it reuses the existing fence-preserving release logic and lets cleanup writes reacquire via the established session write lock path. The safer alternative is not a code change but additional live proof before landing. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against d353dc128fab. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +9, Tests +41. Total +50 across 3 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Cosmic Crabkin Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
martingarramon
left a comment
There was a problem hiding this comment.
Core approach is sound. releaseHeldLockForAbort() is correctly scoped — idempotent, doesn't reacquire, keeps the abort-cleanup path within the existing withSessionWriteLock() reacquire flow. The focused test suite (33/33 passing) covers the narrow lock-lifecycle hazard being addressed.
Two CI notes for the author:
check-dependencies failure is pre-existing main-churn. The flagged file ui/src/ui/control-ui-chunking.ts has nothing to do with this PR — it's a UI-subsystem dead file already present in the base branch. You can confirm this with git log -- ui/src/ui/control-ui-chunking.ts: the last commit touching it precedes this branch. No change needed on your side; this will resolve when the base branch is cleaned up or CI skips it.
Real behavior proof field name mismatch. The check runner is rejecting the steps field. Your PR body uses:
Exact steps run after this patch:
but the accepted list requires one of:
Exact steps or command run after this patchExact steps or command run after the patchExact steps or command run after fixSteps run after the patch
Add or command run after to your existing label (e.g. **Exact steps or command run after this patch:**) and the proof gate should clear.
b9004b2 to
7463aac
Compare
7463aac to
ccd48bb
Compare
ccd48bb to
7625471
Compare
Release the embedded attempt session lock before sessions_yield abort cleanup waits for session events and rewrites yielded-parent artifacts. This keeps the existing bounded settle wait while preventing child completion callbacks from contending on the coarse parent transcript lock. Adds focused session-lock lifecycle coverage.
7625471 to
73bd1b1
Compare
|
Maintainer proof refresh for head fe35122. Behavior addressed: sessions_yield abort cleanup releases the coarse embedded attempt session lock before draining session events and before the short cleanup transcript write reacquires the session lock. Evidence after fix: sessions.cache.test.ts passed, 27 tests; oxlint passed; git diff --check passed; autoreview reported no accepted/actionable findings. |
Summary
Fixes #85953.
Release the embedded attempt session write lock before the
sessions_yieldabort cleanup path waits for session events and rewrites the yielded-parent transcript artifacts.The
sessions_yieldabort path already has a bounded settle wait. After that bounded wait, the parent run can still hold the coarse embedded-attempt transcript lock while it drains session events and then performs the yield cleanup write. If a subagent completion callback tries to write back to the yielded parent at that point, it can contend on the same parent transcript lock and time out withSessionWriteLockTimeoutError.This change adds a narrow defensive release for that abort-cleanup path so the subsequent yield cleanup write reacquires a short lock through the existing
withSessionWriteLockpath instead of keeping the coarse attempt lock held across the wait/drain boundary.What changed
releaseHeldLockForAbort()toEmbeddedAttemptSessionLockController.releaseHeldLockForAbort()in thesessions_yieldabort cleanup path after the existing bounded settle wait and beforewaitForSessionEvents(...)/ yield artifact cleanup writes.Why this is separate from #85716
#85716 overlaps the same failure cluster, but its
sessions_yieldchange waits for the original settle promise after timeout. That may help one path, but it can also become unbounded if the settle promise is the stuck operation.This PR keeps the existing bounded wait behavior and fixes the narrower lock-lifecycle hazard: do not hold the embedded attempt's coarse parent transcript lock while draining/writing the yield-abort cleanup path.
It intentionally does not implement a durable completion inbox, longwork ownership API, or broader subagent reporter architecture. Those are separate design tracks.
Validation
Validation was run in a disposable source checkout. No live Gateway, watched config, production runtime state, or live channel was used.
Passed:
Passed:
Result:
Passed:
Result:
Real behavior proof
sessions_yieldabort cleanup should not keep the parent transcript's coarse embedded-attempt lock held while later session-event waits and yield cleanup writes run. Holding that lock can cause child completion callbacks targeting the yielded parent to time out on the parent.jsonl.lock.7463aac49197, running a runtime helper against the actualcreateEmbeddedAttemptSessionLockController()implementation with a temporary session transcript. No production Gateway, watched config, or live channel was used.withSessionWriteLock().Risk notes
releaseHeldLockForAbort()intentionally does less thanreleaseForPrompt(): it does not refresh the session-file fence or mark a normal prompt-release transition. That is deliberate because this path is abort cleanup, and the subsequent transcript rewrite already goes throughwithSessionWriteLock()and the existing reacquire/fence path.Maintainer proof refresh
Behavior addressed:
sessions_yieldabort cleanup releases the coarse embedded attempt session lock before waiting for session events and reacquiring the short cleanup write lock.Real environment tested: local Node 24.15.0 focused embedded-attempt session-lock tests.
Exact steps or command run after this patch:
CI=1 OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=120000 fnm exec --using v24.15.0 -- node scripts/run-vitest.mjs run src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts --reporter=dot --pool=forks --no-file-parallelismEvidence after fix: 2 files, 70 tests passed.
Observed result after fix: abort cleanup no longer reuses the held coarse attempt lock, and transcript mutation still happens under a reacquired session write lock.
What was not tested: a live multi-agent
sessions_yieldabort race; the focused lock lifecycle regression covers the implicated ordering.