fix(agents): release session lock on manual abort#88623
Conversation
|
Codex review: needs real behavior proof before merge. Reviewed May 31, 2026, 9:52 AM ET / 13:52 UTC. Summary PR surface: Source +20, Tests +44. Total +64 across 3 files. Reproducibility: yes. from source inspection: current main's Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land one canonical abort/session-lock fix that releases the retained lock on every abort, keeps timeout abandonment timeout-only, and includes real abort-and-retry proof. Do we have a high-confidence way to reproduce the issue? Yes from source inspection: current main's Is this the best way to solve the issue? Partially: moving lock release outside the timeout-only branch is the right fix shape, and the helper matches the scoped test guidance. This branch should still gain real abort-and-retry proof or defer to the sibling run-level patch before merge. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 7dea2837565e. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +20, Tests +44. Total +64 across 3 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
bc9d9eb to
d88376e
Compare
70acd6d to
56fa542
Compare
…unsettled retained writes (#6) A turn timeout aborts the run, but releaseHeldLockForAbort delegated to the graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained transcript write pinned behind a hung provider stream (one that ignores abort, e.g. a stalled streamGenerateContent) therefore held the session .jsonl lock until the maxHoldMs watchdog (5-30 min), and every turn in between failed with SessionWriteLockTimeoutError. releaseHeldLockForAbort now races the graceful path against the abort settle bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS). When the bound expires the underlying file lock is force-released and the controller poisoned via takeoverDetected, so the aborted run cannot perform torn writes after losing ownership. Retained writes that settle within the bound keep the existing graceful fence semantics. Prod incident: tulgey#225 (membrane 2026-06-01..03). Completes the lock-leak family of openclaw#87278 / openclaw#88623 / openclaw#89811, which release on settled aborts but not on aborts a hung provider call never lets settle. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Verification
node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt-abort.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/embedded-agent-runner/run/attempt.subscription-cleanup.test.ts test/scripts/plugin-prerelease-test-plan.test.tspnpm check:test-typespnpm exec oxfmt --check --threads=1 src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt-abort.ts src/agents/embedded-agent-runner/run/attempt-abort.test.tsgit diff --check.agents/skills/autoreview/scripts/autoreview --mode branch --base origin/mainRefs #88600
Real behavior proof
Behavior addressed: manual embedded-attempt aborts release the session lock retained while the attempt owns the session, preventing later turns from getting stuck behind the abandoned lock.
Real environment tested: local source checkout on Node 24-era toolchain, plus GitHub PR CI on the branch head.
Exact steps or command run after this patch: focused Vitest for abort/session-lock cleanup,
pnpm check:test-types, oxfmt check,git diff --check, and branch autoreview.Evidence after fix:
attempt-abort.test.tsproves manual abort invokesreleaseHeldLockForAbort()and logs release failures without throwing; existing session-lock/subscription cleanup tests still pass.Observed result after fix: manual abort cleanup path now schedules the lock release independently from timeout-only abandonment marking.
What was not tested: live multi-agent queue reproduction against a real stuck local session lock.