Release session lock on user abort by rohitjavvadi · Pull Request #88622 · openclaw/openclaw

rohitjavvadi · 2026-05-31T12:57:44Z

Summary

release the retained embedded-session write lock for manual/user aborts, not only timeout aborts
keep timeout abandoned-run tracking restricted to timeout aborts
add a regression test that proves the retained lock is released before cleanup reacquires a lock after user abort

Motivation / Root Cause

abortRun(false) aborted compaction and the active session, but the retained session write lock release lived inside the timeout-only branch. A manual abort could leave the original attempt's lock held until teardown, so cleanup or the next turn could block behind the same run's stale held lock.

User-Visible Behavior

Manual stop/interrupt of an embedded run should release the session write lock promptly, allowing cleanup and follow-up work for the same session to proceed without waiting for stale lock timeout recovery.

Real Behavior Proof

Behavior addressed: manual/user abort of an embedded run now releases the retained session write lock before cleanup reacquires a lock for the same session.

Real environment tested: local macOS checkout using the real session-write-lock implementation and a temp persisted session file on disk.

Exact steps or command run after this patch:

node --import tsx /tmp/openclaw-user-abort-lock-proof.mjs
node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/embedded-agent-runner/runs.test.ts src/auto-reply/reply/reply-run-registry.test.ts src/auto-reply/reply/abort.test.ts
node scripts/run-oxlint.mjs src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts
git diff --check

Evidence after fix:

persistedSessionFile=/var/folders/yb/3wkthcjd5vddlnhd480pj8yr0000gn/T/openclaw-user-abort-lock-0H5QQs/agent/session.json
lockPath=/var/folders/yb/3wkthcjd5vddlnhd480pj8yr0000gn/T/openclaw-user-abort-lock-0H5QQs/agent/session.json.lock
beforeAbortRelease=blocked(SessionWriteLockTimeoutError)
abortRelease=releaseHeldLockForAbort completed
afterAbortRelease=cleanupLockAcquired
lockFileExistsAfterCleanup=false

Observed result after fix: before the abort-release call, a cleanup-style lock acquire against the same persisted session file timed out on the live lock; after releaseHeldLockForAbort, cleanup acquired and released the lock, and the lock file was gone. Focused Vitest passed 5 files / 102 tests; targeted lint passed; git diff --check passed.

What was not tested: no live gateway/manual-stop smoke or channel delivery smoke was run.

Compatibility and Risks

No config, API, persistence format, provider, or channel contract changes. The risk is limited to abort/session-lock lifecycle ordering; the release helper is already used for timeout aborts and yield-abort cleanup, and timeout abandoned-run marking remains timeout-only.

clawsweeper · 2026-05-31T13:06:16Z

Codex review: needs maintainer review before merge. Reviewed May 31, 2026, 9:33 AM ET / 13:33 UTC.

Summary
The PR moves retained session lock release out of the timeout-only abort branch and adds a focused user-abort lock ordering regression test.

PR surface: Source -2, Tests +76. Total +74 across 2 files.

Reproducibility: yes. Current main shows the release call guarded by the timeout branch while manual aborts call abortRun(false), matching the reported lock-leak path; I did not run a live gateway stop/retry repro in this read-only review.

Review metrics: none identified.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

[P1] A sibling open PR at fix(agents): release session lock on manual abort #88623 targets the same bug, so maintainers should land one canonical branch rather than both.
[P1] No live gateway manual-stop followed by immediate retry smoke was supplied; the provided proof is strong at the session-lock and cleanup level.

Maintainer options:

Decide the mitigation before merge
Land one canonical narrow fix that releases the retained lock for all abort kinds while keeping abandoned-run tracking timeout-only.
Pause or close
Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

Maintainers should choose the canonical landing branch between this focused PR and the sibling open PR; no automated repair is needed.

Security
Cleared: The diff only changes embedded-runner TypeScript and a focused test; it does not touch dependencies, workflows, package metadata, secrets, or code-download surfaces.

Review details

Best possible solution:

Land one canonical narrow fix that releases the retained lock for all abort kinds while keeping abandoned-run tracking timeout-only.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main shows the release call guarded by the timeout branch while manual aborts call abortRun(false), matching the reported lock-leak path; I did not run a live gateway stop/retry repro in this read-only review.

Is this the best way to solve the issue?

Yes. Moving the release call out of the timeout-only branch while keeping abandoned-run marking inside the timeout branch is the narrowest maintainable fix; maintainers just need to choose this branch or the sibling branch.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9518d1f27ccd.

Label changes

Label changes:

add proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix terminal proof using the real session-write-lock implementation and a temp persisted session file, with cleanup acquiring after abort release.

Label justifications:

P1: The linked bug can leave an agent workflow blocked after a manual stop until session-lock timeout or watchdog recovery.
rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body supplies after-fix terminal proof using the real session-write-lock implementation and a temp persisted session file, with cleanup acquiring after abort release.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix terminal proof using the real session-write-lock implementation and a temp persisted session file, with cleanup acquiring after abort release.

Evidence reviewed

PR surface:

Source -2, Tests +76. Total +74 across 2 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	1	3	5	-2
Tests	1	76	0	+76
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	2	79	5	+74

What I checked:

Repository policy read: Read the full root policy plus scoped agent runner guides; the scoped guide supports full-runner coverage when the behavior requires embedded-runner cross-component proof. (AGENTS.md:1, 9518d1f27ccd)
Current main root cause: On current main, releaseHeldLockForAbort() is still nested inside if (isTimeout && queueHandleForAbandonment), so a non-timeout user abort does not schedule the retained lock release from this abort path. (src/agents/embedded-agent-runner/run/attempt.ts:2979, 9518d1f27ccd)
PR runtime change: The PR head moves sessionLockController.releaseHeldLockForAbort() immediately after abortActiveSession() while leaving abandoned-run marking restricted to timeout aborts. (src/agents/embedded-agent-runner/run/attempt.ts:2976, eeff9c42fed7)
Session-lock contract: releaseHeldLockForAbort() delegates to the fenced held-lock release path, and cleanup reacquires through acquireForCleanup(), which is the lock-ordering path the PR is fixing. (src/agents/embedded-agent-runner/run/attempt.session-lock.ts:896, 9518d1f27ccd)
Regression coverage: The new test aborts a running embedded prompt and asserts the retained lock release happens before cleanup's second acquire. (src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts:64, eeff9c42fed7)
Contributor real proof: The PR body includes after-fix terminal proof with the real session-write-lock implementation and a temp persisted session file, showing cleanup blocked before abort release and acquired after release. (eeff9c42fed7)

Likely related people:

steipete: Current blame for the abort/session-lock code and session-lock controller points to Peter Steinberger's recent embedded-runner refactor, and the sibling fix PR also changes this path. (role: recent area contributor and sibling-fix author; confidence: high; commits: 9f5c981f9fff, d88376ea6786; files: src/agents/embedded-agent-runner/run/attempt.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.ts)
openperf: Merged PR fix(agents): release session write lock on timeout abort (#86816) #87278 added the timeout-only releaseHeldLockForAbort() call that this PR generalizes to manual aborts. (role: introduced timeout-abort release behavior; confidence: medium; commits: 65fb56513fb2; files: src/agents/embedded-agent-runner/run/attempt.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

rohitjavvadi · 2026-05-31T13:10:58Z

Rebased this PR onto current origin/main (17c8602a9c); refreshed head is 49fc82ca37. The diff remains limited to:

src/agents/embedded-agent-runner/run/attempt.ts
src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts

Verification after the rebase:

node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/embedded-agent-runner/runs.test.ts src/auto-reply/reply/reply-run-registry.test.ts src/auto-reply/reply/abort.test.ts
# 5 files passed, 102 tests passed

git diff --check origin/main...HEAD -- src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts
# passed

I also tried the targeted lint wrapper:

node scripts/run-oxlint.mjs src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts

That currently stops before linting the touched files on an unrelated plugin-SDK boundary DTS error in src/infra/device-auth-store.ts; this PR does not touch that file (git diff --exit-code origin/main...HEAD -- src/infra/device-auth-store.ts passed).

@clawsweeper re-review

clawsweeper · 2026-05-31T13:11:01Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26713558998
Updated: 2026-05-31T13:18:00.545Z

rohitjavvadi · 2026-05-31T13:26:52Z

Rebased again onto current origin/main (9518d1f27c); refreshed head is eeff9c42fe. The only previous red check on the prior head was check-additional-runtime-topology-architecture, failing in src/cron/run-log/sqlite-store.ts Kysely guardrails, which this PR does not touch. The updated main contains the upstream fix, and the architecture lane now passes locally.

Current diff remains limited to:

src/agents/embedded-agent-runner/run/attempt.ts
src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts

Verification after this rebase:

node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/embedded-agent-runner/runs.test.ts src/auto-reply/reply/reply-run-registry.test.ts src/auto-reply/reply/abort.test.ts
# 5 files passed, 102 tests passed

pnpm check:architecture
# passed; Kysely guardrails OK

node scripts/run-oxlint.mjs src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts
# passed

git diff --check origin/main...HEAD -- src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts
# passed

This branch remains the narrow two-file session-lock fix with no device-auth/parser side change.

@clawsweeper re-review

clawsweeper · 2026-05-31T13:26:55Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26713905675
Updated: 2026-05-31T13:33:41.555Z

steipete · 2026-05-31T13:55:05Z

Thanks @rohitjavvadi. This is now covered on main by #88623 / 95890fe.

I landed the same fix shape there: manual/user aborts release the retained embedded-session lock through the abort lock-release path, while timeout abandoned-run tracking stays timeout-only. I also kept the regression coverage focused around the shared abort cleanup helper and verified full PR CI on 56fa542.

Closing this as superseded by the landed PR so we do not keep two copies of the same session-lock change open.

rohitjavvadi marked this pull request as ready for review May 31, 2026 13:04

clawsweeper Bot mentioned this pull request May 31, 2026

fix(agents): release session lock on manual abort #88623

Merged

rohitjavvadi force-pushed the fix/agents-release-lock-on-user-abort branch from 29cb280 to 49fc82c Compare May 31, 2026 13:10

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 31, 2026

fix: release session lock on user abort

eeff9c4

rohitjavvadi force-pushed the fix/agents-release-lock-on-user-abort branch from 49fc82c to eeff9c4 Compare May 31, 2026 13:26

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 31, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 31, 2026

steipete closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release session lock on user abort#88622

Release session lock on user abort#88622
rohitjavvadi wants to merge 1 commit into
openclaw:mainfrom
rohitjavvadi:fix/agents-release-lock-on-user-abort

rohitjavvadi commented May 31, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 31, 2026 •

edited

Loading

Uh oh!

rohitjavvadi commented May 31, 2026

Uh oh!

clawsweeper Bot commented May 31, 2026 •

edited

Loading

Uh oh!

rohitjavvadi commented May 31, 2026

Uh oh!

clawsweeper Bot commented May 31, 2026 •

edited

Loading

Uh oh!

steipete commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rohitjavvadi commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Root Cause

User-Visible Behavior

Real Behavior Proof

Compatibility and Risks

Uh oh!

clawsweeper Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohitjavvadi commented May 31, 2026

Uh oh!

clawsweeper Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohitjavvadi commented May 31, 2026

Uh oh!

clawsweeper Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steipete commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rohitjavvadi commented May 31, 2026 •

edited

Loading

clawsweeper Bot commented May 31, 2026 •

edited

Loading

clawsweeper Bot commented May 31, 2026 •

edited

Loading

clawsweeper Bot commented May 31, 2026 •

edited

Loading