Skip to content

Release session lock on user abort#88622

Closed
rohitjavvadi wants to merge 1 commit into
openclaw:mainfrom
rohitjavvadi:fix/agents-release-lock-on-user-abort
Closed

Release session lock on user abort#88622
rohitjavvadi wants to merge 1 commit into
openclaw:mainfrom
rohitjavvadi:fix/agents-release-lock-on-user-abort

Conversation

@rohitjavvadi

@rohitjavvadi rohitjavvadi commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

  • release the retained embedded-session write lock for manual/user aborts, not only timeout aborts
  • keep timeout abandoned-run tracking restricted to timeout aborts
  • add a regression test that proves the retained lock is released before cleanup reacquires a lock after user abort

Fixes #88600

Motivation / Root Cause

abortRun(false) aborted compaction and the active session, but the retained session write lock release lived inside the timeout-only branch. A manual abort could leave the original attempt's lock held until teardown, so cleanup or the next turn could block behind the same run's stale held lock.

User-Visible Behavior

Manual stop/interrupt of an embedded run should release the session write lock promptly, allowing cleanup and follow-up work for the same session to proceed without waiting for stale lock timeout recovery.

Real Behavior Proof

Behavior addressed: manual/user abort of an embedded run now releases the retained session write lock before cleanup reacquires a lock for the same session.

Real environment tested: local macOS checkout using the real session-write-lock implementation and a temp persisted session file on disk.

Exact steps or command run after this patch:

node --import tsx /tmp/openclaw-user-abort-lock-proof.mjs
node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/embedded-agent-runner/runs.test.ts src/auto-reply/reply/reply-run-registry.test.ts src/auto-reply/reply/abort.test.ts
node scripts/run-oxlint.mjs src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts
git diff --check

Evidence after fix:

persistedSessionFile=/var/folders/yb/3wkthcjd5vddlnhd480pj8yr0000gn/T/openclaw-user-abort-lock-0H5QQs/agent/session.json
lockPath=/var/folders/yb/3wkthcjd5vddlnhd480pj8yr0000gn/T/openclaw-user-abort-lock-0H5QQs/agent/session.json.lock
beforeAbortRelease=blocked(SessionWriteLockTimeoutError)
abortRelease=releaseHeldLockForAbort completed
afterAbortRelease=cleanupLockAcquired
lockFileExistsAfterCleanup=false

Observed result after fix: before the abort-release call, a cleanup-style lock acquire against the same persisted session file timed out on the live lock; after releaseHeldLockForAbort, cleanup acquired and released the lock, and the lock file was gone. Focused Vitest passed 5 files / 102 tests; targeted lint passed; git diff --check passed.

What was not tested: no live gateway/manual-stop smoke or channel delivery smoke was run.

Compatibility and Risks

No config, API, persistence format, provider, or channel contract changes. The risk is limited to abort/session-lock lifecycle ordering; the release helper is already used for timeout aborts and yield-abort cleanup, and timeout abandoned-run marking remains timeout-only.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 31, 2026
@rohitjavvadi rohitjavvadi marked this pull request as ready for review May 31, 2026 13:04
@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 31, 2026, 9:33 AM ET / 13:33 UTC.

Summary
The PR moves retained session lock release out of the timeout-only abort branch and adds a focused user-abort lock ordering regression test.

PR surface: Source -2, Tests +76. Total +74 across 2 files.

Reproducibility: yes. Current main shows the release call guarded by the timeout branch while manual aborts call abortRun(false), matching the reported lock-leak path; I did not run a live gateway stop/retry repro in this read-only review.

Review metrics: none identified.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] A sibling open PR at fix(agents): release session lock on manual abort #88623 targets the same bug, so maintainers should land one canonical branch rather than both.
  • [P1] No live gateway manual-stop followed by immediate retry smoke was supplied; the provided proof is strong at the session-lock and cleanup level.

Maintainer options:

  1. Decide the mitigation before merge
    Land one canonical narrow fix that releases the retained lock for all abort kinds while keeping abandoned-run tracking timeout-only.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • Maintainers should choose the canonical landing branch between this focused PR and the sibling open PR; no automated repair is needed.

Security
Cleared: The diff only changes embedded-runner TypeScript and a focused test; it does not touch dependencies, workflows, package metadata, secrets, or code-download surfaces.

Review details

Best possible solution:

Land one canonical narrow fix that releases the retained lock for all abort kinds while keeping abandoned-run tracking timeout-only.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main shows the release call guarded by the timeout branch while manual aborts call abortRun(false), matching the reported lock-leak path; I did not run a live gateway stop/retry repro in this read-only review.

Is this the best way to solve the issue?

Yes. Moving the release call out of the timeout-only branch while keeping abandoned-run marking inside the timeout branch is the narrowest maintainable fix; maintainers just need to choose this branch or the sibling branch.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9518d1f27ccd.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix terminal proof using the real session-write-lock implementation and a temp persisted session file, with cleanup acquiring after abort release.

Label justifications:

  • P1: The linked bug can leave an agent workflow blocked after a manual stop until session-lock timeout or watchdog recovery.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body supplies after-fix terminal proof using the real session-write-lock implementation and a temp persisted session file, with cleanup acquiring after abort release.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix terminal proof using the real session-write-lock implementation and a temp persisted session file, with cleanup acquiring after abort release.
Evidence reviewed

PR surface:

Source -2, Tests +76. Total +74 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 3 5 -2
Tests 1 76 0 +76
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 79 5 +74

What I checked:

Likely related people:

  • steipete: Current blame for the abort/session-lock code and session-lock controller points to Peter Steinberger's recent embedded-runner refactor, and the sibling fix PR also changes this path. (role: recent area contributor and sibling-fix author; confidence: high; commits: 9f5c981f9fff, d88376ea6786; files: src/agents/embedded-agent-runner/run/attempt.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.ts)
  • openperf: Merged PR fix(agents): release session write lock on timeout abort (#86816) #87278 added the timeout-only releaseHeldLockForAbort() call that this PR generalizes to manual aborts. (role: introduced timeout-abort release behavior; confidence: medium; commits: 65fb56513fb2; files: src/agents/embedded-agent-runner/run/attempt.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. labels May 31, 2026
@rohitjavvadi rohitjavvadi force-pushed the fix/agents-release-lock-on-user-abort branch from 29cb280 to 49fc82c Compare May 31, 2026 13:10
@rohitjavvadi

Copy link
Copy Markdown
Contributor Author

Rebased this PR onto current origin/main (17c8602a9c); refreshed head is 49fc82ca37. The diff remains limited to:

  • src/agents/embedded-agent-runner/run/attempt.ts
  • src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts

Verification after the rebase:

node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/embedded-agent-runner/runs.test.ts src/auto-reply/reply/reply-run-registry.test.ts src/auto-reply/reply/abort.test.ts
# 5 files passed, 102 tests passed

git diff --check origin/main...HEAD -- src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts
# passed

I also tried the targeted lint wrapper:

node scripts/run-oxlint.mjs src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts

That currently stops before linting the touched files on an unrelated plugin-SDK boundary DTS error in src/infra/device-auth-store.ts; this PR does not touch that file (git diff --exit-code origin/main...HEAD -- src/infra/device-auth-store.ts passed).

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 31, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 31, 2026
@rohitjavvadi rohitjavvadi force-pushed the fix/agents-release-lock-on-user-abort branch from 49fc82c to eeff9c4 Compare May 31, 2026 13:26
@rohitjavvadi

Copy link
Copy Markdown
Contributor Author

Rebased again onto current origin/main (9518d1f27c); refreshed head is eeff9c42fe. The only previous red check on the prior head was check-additional-runtime-topology-architecture, failing in src/cron/run-log/sqlite-store.ts Kysely guardrails, which this PR does not touch. The updated main contains the upstream fix, and the architecture lane now passes locally.

Current diff remains limited to:

  • src/agents/embedded-agent-runner/run/attempt.ts
  • src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts

Verification after this rebase:

node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/embedded-agent-runner/runs.test.ts src/auto-reply/reply/reply-run-registry.test.ts src/auto-reply/reply/abort.test.ts
# 5 files passed, 102 tests passed

pnpm check:architecture
# passed; Kysely guardrails OK

node scripts/run-oxlint.mjs src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts
# passed

git diff --check origin/main...HEAD -- src/agents/embedded-agent-runner/run/attempt.ts src/agents/embedded-agent-runner/run/attempt.user-abort-lock.test.ts
# passed

This branch remains the narrow two-file session-lock fix with no device-auth/parser side change.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 31, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 31, 2026
@steipete

Copy link
Copy Markdown
Contributor

Thanks @rohitjavvadi. This is now covered on main by #88623 / 95890fe.

I landed the same fix shape there: manual/user aborts release the retained embedded-session lock through the abort lock-release path, while timeout abandoned-run tracking stays timeout-only. I also kept the regression coverage focused around the shared abort cleanup helper and verified full PR CI on 56fa542.

Closing this as superseded by the landed PR so we do not keep two copies of the same session-lock change open.

@steipete steipete closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. size: S status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session file lock leak when user manually aborts agent (non-timeout abort never releases lock)

2 participants