Skip to content

fix(agents): release session lock on manual abort#88623

Merged
steipete merged 1 commit into
mainfrom
fix/manual-abort-session-lock
May 31, 2026
Merged

fix(agents): release session lock on manual abort#88623
steipete merged 1 commit into
mainfrom
fix/manual-abort-session-lock

Conversation

@steipete

@steipete steipete commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Release the embedded attempt session lock on manual aborts through the same best-effort abort cleanup path used by timeout aborts.
  • Keep timeout abandonment marking scoped to timeout aborts, so manual aborts only release the retained lock.
  • Add focused regression coverage for manual abort lock release and release failures.

Verification

  • node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt-abort.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/embedded-agent-runner/run/attempt.subscription-cleanup.test.ts test/scripts/plugin-prerelease-test-plan.test.ts
  • pnpm check:test-types
  • pnpm exec oxfmt --check --threads=1 src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt-abort.ts src/agents/embedded-agent-runner/run/attempt-abort.test.ts
  • git diff --check
  • .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main

Refs #88600

Real behavior proof

Behavior addressed: manual embedded-attempt aborts release the session lock retained while the attempt owns the session, preventing later turns from getting stuck behind the abandoned lock.
Real environment tested: local source checkout on Node 24-era toolchain, plus GitHub PR CI on the branch head.
Exact steps or command run after this patch: focused Vitest for abort/session-lock cleanup, pnpm check:test-types, oxfmt check, git diff --check, and branch autoreview.
Evidence after fix: attempt-abort.test.ts proves manual abort invokes releaseHeldLockForAbort() and logs release failures without throwing; existing session-lock/subscription cleanup tests still pass.
Observed result after fix: manual abort cleanup path now schedules the lock release independently from timeout-only abandonment marking.
What was not tested: live multi-agent queue reproduction against a real stuck local session lock.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 31, 2026
@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 31, 2026, 9:52 AM ET / 13:52 UTC.

Summary
The branch extracts abort session-lock release into a helper, calls it for both manual and timeout embedded attempt aborts, and adds focused helper tests.

PR surface: Source +20, Tests +44. Total +64 across 3 files.

Reproducibility: yes. from source inspection: current main's abortRun releases the retained lock only in the timeout branch, while manual abort paths call it without timeout. I did not run a live stuck-lock reproduction during this read-only review.

Review metrics: none identified.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Add redacted real behavior proof showing manual stop followed by immediate retry without waiting for stale lock timeout.
  • [P1] Ask maintainers to choose this PR or Release session lock on user abort #88622 as the canonical fix before landing.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body provides focused tests and CI-style checks, and explicitly says no live multi-agent queue reproduction was run; add redacted terminal/log/live output or a short manual-stop-then-retry recording before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] The branch still has mock-only proof: the PR body lists focused tests/CI and explicitly says no live multi-agent queue reproduction was run.
  • [P1] Release session lock on user abort #88622 is an open parallel fix for the same linked behavior with stronger run-level proof, so maintainers need one canonical landing path.
  • [P2] Abort lock-release ordering is session-state sensitive; the final patch must keep timeout abandoned-run marking timeout-only while releasing the retained lock for every abort.

Maintainer options:

  1. Choose the canonical proof-backed patch (recommended)
    Land either this PR after adding real abort-and-retry proof or the sibling proof-positive PR, not both, while preserving timeout-only abandonment marking.
  2. Accept helper-level proof intentionally
    Maintainers may land this helper-based patch with explicit acceptance that the real stuck-lock user path was not demonstrated on this branch.
  3. Close the duplicate after one lands
    If the sibling patch lands first, close this PR as superseded rather than carrying a second abort-lock implementation.

Next step before merge

  • [P1] Protected maintainer labeling, missing real behavior proof, and the open sibling PR make this a maintainer canonical-landing choice rather than an automated repair lane.

Security
Cleared: The diff only changes embedded-runner abort cleanup code and tests; it does not touch dependencies, workflows, secrets, permissions, or package execution surfaces.

Review details

Best possible solution:

Land one canonical abort/session-lock fix that releases the retained lock on every abort, keeps timeout abandonment timeout-only, and includes real abort-and-retry proof.

Do we have a high-confidence way to reproduce the issue?

Yes from source inspection: current main's abortRun releases the retained lock only in the timeout branch, while manual abort paths call it without timeout. I did not run a live stuck-lock reproduction during this read-only review.

Is this the best way to solve the issue?

Partially: moving lock release outside the timeout-only branch is the right fix shape, and the helper matches the scoped test guidance. This branch should still gain real abort-and-retry proof or defer to the sibling run-level patch before merge.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7dea2837565e.

Label changes

Label changes:

  • add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🐚 platinum hermit.
  • remove rating: 🦪 silver shellfish: Current PR rating is rating: 🧂 unranked krab, so this older rating label is no longer current.

Label justifications:

  • P1: The linked bug can leave agent turns blocked behind a leaked session lock after manual abort, which is an urgent broken agent workflow.
  • merge-risk: 🚨 session-state: The patch changes abort-time retained session-lock release ordering, which can affect whether later turns see stale, released, or mis-owned session state.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body provides focused tests and CI-style checks, and explicitly says no live multi-agent queue reproduction was run; add redacted terminal/log/live output or a short manual-stop-then-retry recording before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +20, Tests +44. Total +64 across 3 files.

View PR surface stats
Area Files Added Removed Net
Source 2 25 5 +20
Tests 1 44 0 +44
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 3 69 5 +64

What I checked:

Likely related people:

  • brokemac79: git blame attributes the current abortRun block and releaseHeldLockForAbort controller method to the stale-session runner lifecycle commit. (role: introduced behavior; confidence: high; commits: e8c7c933f8c1; files: src/agents/embedded-agent-runner/run/attempt.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.ts)
  • Peter Steinberger: Recent current-main commits touched attempt.ts and adjacent subagent queue behavior, making him a likely routing candidate for the embedded-runner/session-state surface. (role: recent area contributor; confidence: medium; commits: f24a13879095, 1e54e908e2e4; files: src/agents/embedded-agent-runner/run/attempt.ts, src/agents/subagent-delivery-state.ts, src/agents/subagent-registry.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. labels May 31, 2026
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 31, 2026
@steipete steipete force-pushed the fix/manual-abort-session-lock branch from bc9d9eb to d88376e Compare May 31, 2026 13:29
@clawsweeper clawsweeper Bot removed the merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. label May 31, 2026
@steipete steipete force-pushed the fix/manual-abort-session-lock branch from 70acd6d to 56fa542 Compare May 31, 2026 13:46
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 31, 2026
@steipete steipete merged commit 95890fe into main May 31, 2026
158 of 160 checks passed
@steipete steipete deleted the fix/manual-abort-session-lock branch May 31, 2026 13:53
matin added a commit to matin/openclaw that referenced this pull request Jun 3, 2026
…unsettled retained writes (#6)

A turn timeout aborts the run, but releaseHeldLockForAbort delegated to the
graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained
transcript write pinned behind a hung provider stream (one that ignores abort,
e.g. a stalled streamGenerateContent) therefore held the session .jsonl lock
until the maxHoldMs watchdog (5-30 min), and every turn in between failed with
SessionWriteLockTimeoutError.

releaseHeldLockForAbort now races the graceful path against the abort settle
bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS). When the bound expires the
underlying file lock is force-released and the controller poisoned via
takeoverDetected, so the aborted run cannot perform torn writes after losing
ownership. Retained writes that settle within the bound keep the existing
graceful fence semantics.

Prod incident: tulgey#225 (membrane 2026-06-01..03). Completes the lock-leak
family of openclaw#87278 / openclaw#88623 / openclaw#89811, which release on settled aborts but not on
aborts a hung provider call never lets settle.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: S status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant