fix(agents): report stale session locks without cleanup by steipete · Pull Request #88658 · openclaw/openclaw

steipete · 2026-05-31T14:21:52Z

Summary

Split stale session-lock detection into report vs removal paths so live OpenClaw-owned stale locks surface as typed contention instead of being deleted.
Add SessionWriteLockStaleError and route stale acquire failures through embedded-run takeover, failover suppression, prompt-cache propagation, announce delivery, and QA retry classifiers.
Cover active in-process stale locks, live OpenClaw-owned stale locks, embedded post-prompt stale reacquire, and QA retry behavior.

Verification

node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/failover-error.test.ts src/agents/subagent-announce-delivery.test.ts src/agents/embedded-agent-runner/google-prompt-cache.test.ts extensions/qa-lab/src/suite-runtime-agent-session.test.ts
pnpm tsgo:test:src failed on latest origin/main in unrelated src/agents/acp-spawn.test.ts(984,15): Object literal may only specify known properties, and 'thinking' does not exist...
.agents/skills/autoreview/scripts/autoreview --mode local clean after fixing accepted QA retry finding

Real behavior proof

Behavior addressed: stale session transcript locks owned by a live OpenClaw process are reported as typed stale-lock contention without deleting the lock; dead/orphan reclaim remains on the existing safe removal path.
Real environment tested: local macOS checkout on branch fix/session-stale-lock-diagnostics after rebase onto origin/main.
Exact steps or command run after this patch: node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/failover-error.test.ts src/agents/subagent-announce-delivery.test.ts src/agents/embedded-agent-runner/google-prompt-cache.test.ts extensions/qa-lab/src/suite-runtime-agent-session.test.ts.
Evidence after fix: 3 Vitest shards passed; 6 files passed; 269 tests passed.
Observed result after fix: live OpenClaw-owned stale lock test throws SessionWriteLockStaleError and leaves the lock file in place; QA helper retries stale lock contention.
What was not tested: live gateway multi-agent sessions_send / sessions_spawn stress run; broad tsgo:test:src is blocked by unrelated latest-main ACP spawn test type error noted above.

clawsweeper · 2026-05-31T14:24:06Z

Codex review: needs real behavior proof before merge. Reviewed May 31, 2026, 10:28 AM ET / 14:28 UTC.

Summary
The branch adds a typed stale session-lock error, splits stale lock reporting from removal, routes the new contention signal through agent/QA classifiers, and adds focused tests.

PR surface: Source +106, Tests +98. Total +204 across 10 files.

Reproducibility: yes. source-reproducible: current main’s acquire path treats stale live OpenClaw locks as reclaimable and passes them to fs-safe’s removal path. I did not execute tests because this review is read-only.

Review metrics: 1 noteworthy metric.

Config behavior affected: 1 existing session.writeLock.staleMs behavior changed. Operators need the public help/docs to explain when stale locks are reported instead of reclaimed before merge.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

[P1] Add redacted real behavior proof from a gateway sessions_send/sessions_spawn or packaged smoke run, not only Vitest.
Update staleMs schema help and session management docs for report-only live locks.
[P1] Get explicit maintainer acceptance of the fail-closed stale-lock tradeoff.

Proof guidance:

[P1] Needs real behavior proof before merge: The PR body’s proof is targeted Vitest output only; before merge it needs redacted real gateway, packaged, terminal, log, or live-output proof of the changed session-lock behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

[P1] Existing operators using session.writeLock.staleMs may still expect stale locks to be reclaimed; the PR changes live OpenClaw-owned too-old/hold-exceeded locks to report contention instead, while help/docs remain unchanged.
[P1] The PR has no live gateway sessions_send or sessions_spawn proof, so message delivery and session recovery under the reported concurrent workload remain unproven.

Maintainer options:

Update docs and prove live gateway behavior (recommended)
Before merge, update staleMs help/docs for report-only live OpenClaw locks and add redacted real gateway or packaged sessions_send/sessions_spawn proof showing locks are not deleted or stranded.
Accept the fail-closed tradeoff
Maintainers can intentionally merge with explicit acceptance that live stale locks now surface as contention instead of acquisition-time cleanup, with release-note context and follow-up live proof.

Next step before merge

[P1] Protected maintainer-labeled session-state PR needs human approval, real behavior proof, and the docs/help contract repair before merge.

Security
Cleared: No concrete security or supply-chain concern was found; the risk is session-state and compatibility behavior, not new third-party code or secret handling.

Review findings

[P2] Update stale-lock operator wording — src/agents/session-write-lock.ts:606-608

Review details

Best possible solution:

Land the fail-closed typed stale-lock path only after maintainer approval, operator help/docs alignment, and live or packaged proof for session send/spawn flows.

Do we have a high-confidence way to reproduce the issue?

Yes, source-reproducible: current main’s acquire path treats stale live OpenClaw locks as reclaimable and passes them to fs-safe’s removal path. I did not execute tests because this review is read-only.

Is this the best way to solve the issue?

Unclear: the code-level split is narrow and safer for lock integrity, but the docs/proof/product-approval gap means it is not merge-ready yet.

Full review comments:

[P2] Update stale-lock operator wording — src/agents/session-write-lock.ts:606-608
This change makes report-only stale reasons like too-old and hold-exceeded stop being removable during acquire/cleanup, so session.writeLock.staleMs no longer always means a stale lock will be reclaimed. Please update the schema help and session management docs to distinguish reported live OpenClaw locks from dead/orphan lock removal.
Confidence: 0.86

Overall correctness: patch is incorrect
Overall confidence: 0.82

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 88c99ddf5f82.

Label changes

Label changes:

add P1: The PR targets a linked P1 session-lock failure affecting sessions_send, sessions_spawn, and stuck-but-alive agent workflows.
add merge-risk: 🚨 compatibility: The branch changes the behavior of the existing session.writeLock.staleMs surface for live OpenClaw-owned stale locks.
add merge-risk: 🚨 message-delivery: The linked failure affects inter-agent send/spawn delivery, and live delivery behavior is not proven by the PR evidence.
add merge-risk: 🚨 session-state: Merging changes whether session transcript locks are deleted or preserved and reported as contention.
add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body’s proof is targeted Vitest output only; before merge it needs redacted real gateway, packaged, terminal, log, or live-output proof of the changed session-lock behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Label justifications:

P1: The PR targets a linked P1 session-lock failure affecting sessions_send, sessions_spawn, and stuck-but-alive agent workflows.
merge-risk: 🚨 compatibility: The branch changes the behavior of the existing session.writeLock.staleMs surface for live OpenClaw-owned stale locks.
merge-risk: 🚨 session-state: Merging changes whether session transcript locks are deleted or preserved and reported as contention.
merge-risk: 🚨 message-delivery: The linked failure affects inter-agent send/spawn delivery, and live delivery behavior is not proven by the PR evidence.
rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body’s proof is targeted Vitest output only; before merge it needs redacted real gateway, packaged, terminal, log, or live-output proof of the changed session-lock behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Evidence reviewed

PR surface:

Source +106, Tests +98. Total +204 across 10 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	7	140	34	+106
Tests	3	100	2	+98
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	10	240	36	+204

What I checked:

Repository policy read: Root and scoped agent/extension policies were read; the review applied the session-state/config compatibility and proof requirements. (AGENTS.md:16, 88c99ddf5f82)
Current acquire behavior: Current main uses the same reclaim helper for acquire-time stale detection and stale-lock removal, so report-only live stale reasons can flow into removal during acquisition. (src/agents/session-write-lock.ts:566, 88c99ddf5f82)
Dependency contract: @openclaw/fs-safe 0.3.0 calls shouldReclaim first, then attempts remove-if-unchanged via shouldRemoveStaleLock, and throws file_lock_stale when removal is not approved.
PR stale split: The PR adds a report-only stale path, blocks removal for report-only reasons, and converts file_lock_stale into SessionWriteLockStaleError. (src/agents/session-write-lock.ts:569, 840cc2dc9e44)
Classifier coverage: The PR broadens embedded lock handling so stale acquire errors mark takeover and cleanup reacquire suppression like timeout errors. (src/agents/embedded-agent-runner/run/attempt.session-lock.ts:654, 840cc2dc9e44)
Config/help mismatch: The PR changes staleMs behavior for live OpenClaw-owned stale locks, but the schema help still says staleMs controls when an existing lock can be reclaimed. (src/config/schema.help.ts:1651, 840cc2dc9e44)

Likely related people:

steipete: Peter Steinberger authored the current main session-write-lock implementation in b222b5f and authored this PR branch touching the same lock/session surfaces. (role: introduced current behavior and current PR author; confidence: medium; commits: b222b5f6faac, 840cc2dc9e44; files: src/agents/session-write-lock.ts, src/agents/session-write-lock-error.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

steipete · 2026-05-31T14:28:44Z

Landing proof for 840cc2d:

Behavior addressed: stale session locks owned by a live OpenClaw process are reported as typed lock-acquisition failures instead of being auto-deleted; safely reclaimable dead/orphan/recycled locks still clean up.

Exact steps or command run after this patch:

node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/failover-error.test.ts src/agents/subagent-announce-delivery.test.ts src/agents/embedded-agent-runner/google-prompt-cache.test.ts extensions/qa-lab/src/suite-runtime-agent-session.test.ts
.agents/skills/autoreview/scripts/autoreview --mode local
pnpm tsgo:test:src
CI on PR head 840cc2d

Evidence after fix:

Focused Vitest passed: 6 files, 269 tests.
Autoreview clean after fixing the accepted QA retry-classifier finding.
CI: relevant runtime/check shards are green; check-test-types is red on unrelated latest-main src/agents/acp-spawn.test.ts(984,15) because thinking is not a known property. This file is outside the PR diff and reproduced locally after rebasing onto latest origin/main.

Known proof gap:

No live multi-agent stress repro was available for Session lock auto-cleanup on staleness detection #87779, so this lands the safe diagnostic/non-destructive stale-lock behavior and classifier propagation, not blind runtime cleanup/retry.

Follow-up to #88658. Retries transient stale session-lock acquire failures when diagnostics show the old stale report disappeared, was replaced by a fresh valid lock, or was replaced by a fresh payload-less lock still inside the mtime/orphan grace window. Preserves typed `SessionWriteLockStaleError` diagnostics for still-present live OpenClaw-owned stale locks. Proof: 53 focused session-write-lock tests passed locally and in the agents-core CI shard; `pnpm tsgo:test:src`, touched-file oxlint, `git diff --check`, and autoreview passed locally. CI run 26716843811 has unrelated failures in UI deadcode/types and bash-tools tests; session-write-lock tests passed in that run. Refs #87217.

Follow-up to openclaw#88658. Retries transient stale session-lock acquire failures when diagnostics show the old stale report disappeared, was replaced by a fresh valid lock, or was replaced by a fresh payload-less lock still inside the mtime/orphan grace window. Preserves typed `SessionWriteLockStaleError` diagnostics for still-present live OpenClaw-owned stale locks. Proof: 53 focused session-write-lock tests passed locally and in the agents-core CI shard; `pnpm tsgo:test:src`, touched-file oxlint, `git diff --check`, and autoreview passed locally. CI run 26716843811 has unrelated failures in UI deadcode/types and bash-tools tests; session-write-lock tests passed in that run. Refs openclaw#87217.

fix(agents): report stale session locks without cleanup

840cc2d

openclaw-barnacle Bot added agents Agent runtime and tooling extensions: qa-lab size: M maintainer Maintainer-authored PR labels May 31, 2026

steipete merged commit 7ca7712 into main May 31, 2026
218 of 229 checks passed

steipete deleted the fix/session-stale-lock-diagnostics branch May 31, 2026 14:28

This was referenced May 31, 2026

fix(agents): retry stale session lock reports #87342

Closed

fix(agents): retry transient stale session locks #88670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): report stale session locks without cleanup#88658

fix(agents): report stale session locks without cleanup#88658
steipete merged 1 commit into
mainfrom
fix/session-stale-lock-diagnostics

steipete commented May 31, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 31, 2026 •

edited

Loading

Uh oh!

steipete commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

steipete commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Real behavior proof

Uh oh!

clawsweeper Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steipete commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

steipete commented May 31, 2026 •

edited

Loading

clawsweeper Bot commented May 31, 2026 •

edited

Loading