Skip to content

fix(agents): report stale session locks without cleanup#88658

Merged
steipete merged 1 commit into
mainfrom
fix/session-stale-lock-diagnostics
May 31, 2026
Merged

fix(agents): report stale session locks without cleanup#88658
steipete merged 1 commit into
mainfrom
fix/session-stale-lock-diagnostics

Conversation

@steipete

@steipete steipete commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Split stale session-lock detection into report vs removal paths so live OpenClaw-owned stale locks surface as typed contention instead of being deleted.
  • Add SessionWriteLockStaleError and route stale acquire failures through embedded-run takeover, failover suppression, prompt-cache propagation, announce delivery, and QA retry classifiers.
  • Cover active in-process stale locks, live OpenClaw-owned stale locks, embedded post-prompt stale reacquire, and QA retry behavior.

Refs #87779

Verification

  • node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/failover-error.test.ts src/agents/subagent-announce-delivery.test.ts src/agents/embedded-agent-runner/google-prompt-cache.test.ts extensions/qa-lab/src/suite-runtime-agent-session.test.ts
  • pnpm tsgo:test:src failed on latest origin/main in unrelated src/agents/acp-spawn.test.ts(984,15): Object literal may only specify known properties, and 'thinking' does not exist...
  • .agents/skills/autoreview/scripts/autoreview --mode local clean after fixing accepted QA retry finding

Real behavior proof

Behavior addressed: stale session transcript locks owned by a live OpenClaw process are reported as typed stale-lock contention without deleting the lock; dead/orphan reclaim remains on the existing safe removal path.
Real environment tested: local macOS checkout on branch fix/session-stale-lock-diagnostics after rebase onto origin/main.
Exact steps or command run after this patch: node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/failover-error.test.ts src/agents/subagent-announce-delivery.test.ts src/agents/embedded-agent-runner/google-prompt-cache.test.ts extensions/qa-lab/src/suite-runtime-agent-session.test.ts.
Evidence after fix: 3 Vitest shards passed; 6 files passed; 269 tests passed.
Observed result after fix: live OpenClaw-owned stale lock test throws SessionWriteLockStaleError and leaves the lock file in place; QA helper retries stale lock contention.
What was not tested: live gateway multi-agent sessions_send / sessions_spawn stress run; broad tsgo:test:src is blocked by unrelated latest-main ACP spawn test type error noted above.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling extensions: qa-lab size: M maintainer Maintainer-authored PR labels May 31, 2026
@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 31, 2026, 10:28 AM ET / 14:28 UTC.

Summary
The branch adds a typed stale session-lock error, splits stale lock reporting from removal, routes the new contention signal through agent/QA classifiers, and adds focused tests.

PR surface: Source +106, Tests +98. Total +204 across 10 files.

Reproducibility: yes. source-reproducible: current main’s acquire path treats stale live OpenClaw locks as reclaimable and passes them to fs-safe’s removal path. I did not execute tests because this review is read-only.

Review metrics: 1 noteworthy metric.

  • Config behavior affected: 1 existing session.writeLock.staleMs behavior changed. Operators need the public help/docs to explain when stale locks are reported instead of reclaimed before merge.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted real behavior proof from a gateway sessions_send/sessions_spawn or packaged smoke run, not only Vitest.
  • Update staleMs schema help and session management docs for report-only live locks.
  • [P1] Get explicit maintainer acceptance of the fail-closed stale-lock tradeoff.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body’s proof is targeted Vitest output only; before merge it needs redacted real gateway, packaged, terminal, log, or live-output proof of the changed session-lock behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] Existing operators using session.writeLock.staleMs may still expect stale locks to be reclaimed; the PR changes live OpenClaw-owned too-old/hold-exceeded locks to report contention instead, while help/docs remain unchanged.
  • [P1] The PR has no live gateway sessions_send or sessions_spawn proof, so message delivery and session recovery under the reported concurrent workload remain unproven.

Maintainer options:

  1. Update docs and prove live gateway behavior (recommended)
    Before merge, update staleMs help/docs for report-only live OpenClaw locks and add redacted real gateway or packaged sessions_send/sessions_spawn proof showing locks are not deleted or stranded.
  2. Accept the fail-closed tradeoff
    Maintainers can intentionally merge with explicit acceptance that live stale locks now surface as contention instead of acquisition-time cleanup, with release-note context and follow-up live proof.

Next step before merge

  • [P1] Protected maintainer-labeled session-state PR needs human approval, real behavior proof, and the docs/help contract repair before merge.

Security
Cleared: No concrete security or supply-chain concern was found; the risk is session-state and compatibility behavior, not new third-party code or secret handling.

Review findings

  • [P2] Update stale-lock operator wording — src/agents/session-write-lock.ts:606-608
Review details

Best possible solution:

Land the fail-closed typed stale-lock path only after maintainer approval, operator help/docs alignment, and live or packaged proof for session send/spawn flows.

Do we have a high-confidence way to reproduce the issue?

Yes, source-reproducible: current main’s acquire path treats stale live OpenClaw locks as reclaimable and passes them to fs-safe’s removal path. I did not execute tests because this review is read-only.

Is this the best way to solve the issue?

Unclear: the code-level split is narrow and safer for lock integrity, but the docs/proof/product-approval gap means it is not merge-ready yet.

Full review comments:

  • [P2] Update stale-lock operator wording — src/agents/session-write-lock.ts:606-608
    This change makes report-only stale reasons like too-old and hold-exceeded stop being removable during acquire/cleanup, so session.writeLock.staleMs no longer always means a stale lock will be reclaimed. Please update the schema help and session management docs to distinguish reported live OpenClaw locks from dead/orphan lock removal.
    Confidence: 0.86

Overall correctness: patch is incorrect
Overall confidence: 0.82

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 88c99ddf5f82.

Label changes

Label changes:

  • add P1: The PR targets a linked P1 session-lock failure affecting sessions_send, sessions_spawn, and stuck-but-alive agent workflows.
  • add merge-risk: 🚨 compatibility: The branch changes the behavior of the existing session.writeLock.staleMs surface for live OpenClaw-owned stale locks.
  • add merge-risk: 🚨 message-delivery: The linked failure affects inter-agent send/spawn delivery, and live delivery behavior is not proven by the PR evidence.
  • add merge-risk: 🚨 session-state: Merging changes whether session transcript locks are deleted or preserved and reported as contention.
  • add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body’s proof is targeted Vitest output only; before merge it needs redacted real gateway, packaged, terminal, log, or live-output proof of the changed session-lock behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Label justifications:

  • P1: The PR targets a linked P1 session-lock failure affecting sessions_send, sessions_spawn, and stuck-but-alive agent workflows.
  • merge-risk: 🚨 compatibility: The branch changes the behavior of the existing session.writeLock.staleMs surface for live OpenClaw-owned stale locks.
  • merge-risk: 🚨 session-state: Merging changes whether session transcript locks are deleted or preserved and reported as contention.
  • merge-risk: 🚨 message-delivery: The linked failure affects inter-agent send/spawn delivery, and live delivery behavior is not proven by the PR evidence.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body’s proof is targeted Vitest output only; before merge it needs redacted real gateway, packaged, terminal, log, or live-output proof of the changed session-lock behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +106, Tests +98. Total +204 across 10 files.

View PR surface stats
Area Files Added Removed Net
Source 7 140 34 +106
Tests 3 100 2 +98
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 10 240 36 +204

What I checked:

  • Repository policy read: Root and scoped agent/extension policies were read; the review applied the session-state/config compatibility and proof requirements. (AGENTS.md:16, 88c99ddf5f82)
  • Current acquire behavior: Current main uses the same reclaim helper for acquire-time stale detection and stale-lock removal, so report-only live stale reasons can flow into removal during acquisition. (src/agents/session-write-lock.ts:566, 88c99ddf5f82)
  • Dependency contract: @openclaw/fs-safe 0.3.0 calls shouldReclaim first, then attempts remove-if-unchanged via shouldRemoveStaleLock, and throws file_lock_stale when removal is not approved.
  • PR stale split: The PR adds a report-only stale path, blocks removal for report-only reasons, and converts file_lock_stale into SessionWriteLockStaleError. (src/agents/session-write-lock.ts:569, 840cc2dc9e44)
  • Classifier coverage: The PR broadens embedded lock handling so stale acquire errors mark takeover and cleanup reacquire suppression like timeout errors. (src/agents/embedded-agent-runner/run/attempt.session-lock.ts:654, 840cc2dc9e44)
  • Config/help mismatch: The PR changes staleMs behavior for live OpenClaw-owned stale locks, but the schema help still says staleMs controls when an existing lock can be reclaimed. (src/config/schema.help.ts:1651, 840cc2dc9e44)

Likely related people:

  • steipete: Peter Steinberger authored the current main session-write-lock implementation in b222b5f and authored this PR branch touching the same lock/session surfaces. (role: introduced current behavior and current PR author; confidence: medium; commits: b222b5f6faac, 840cc2dc9e44; files: src/agents/session-write-lock.ts, src/agents/session-write-lock-error.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 31, 2026
@steipete

Copy link
Copy Markdown
Contributor Author

Landing proof for 840cc2d:

Behavior addressed: stale session locks owned by a live OpenClaw process are reported as typed lock-acquisition failures instead of being auto-deleted; safely reclaimable dead/orphan/recycled locks still clean up.

Exact steps or command run after this patch:

  • node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts src/agents/failover-error.test.ts src/agents/subagent-announce-delivery.test.ts src/agents/embedded-agent-runner/google-prompt-cache.test.ts extensions/qa-lab/src/suite-runtime-agent-session.test.ts
  • .agents/skills/autoreview/scripts/autoreview --mode local
  • pnpm tsgo:test:src
  • CI on PR head 840cc2d

Evidence after fix:

  • Focused Vitest passed: 6 files, 269 tests.
  • Autoreview clean after fixing the accepted QA retry-classifier finding.
  • CI: relevant runtime/check shards are green; check-test-types is red on unrelated latest-main src/agents/acp-spawn.test.ts(984,15) because thinking is not a known property. This file is outside the PR diff and reproduced locally after rebasing onto latest origin/main.

Known proof gap:

@steipete steipete merged commit 7ca7712 into main May 31, 2026
218 of 229 checks passed
@steipete steipete deleted the fix/session-stale-lock-diagnostics branch May 31, 2026 14:28
steipete added a commit that referenced this pull request May 31, 2026
Follow-up to #88658. Retries transient stale session-lock acquire failures when diagnostics show the old stale report disappeared, was replaced by a fresh valid lock, or was replaced by a fresh payload-less lock still inside the mtime/orphan grace window.

Preserves typed `SessionWriteLockStaleError` diagnostics for still-present live OpenClaw-owned stale locks.

Proof: 53 focused session-write-lock tests passed locally and in the agents-core CI shard; `pnpm tsgo:test:src`, touched-file oxlint, `git diff --check`, and autoreview passed locally. CI run 26716843811 has unrelated failures in UI deadcode/types and bash-tools tests; session-write-lock tests passed in that run.

Refs #87217.
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request Jun 1, 2026
Follow-up to openclaw#88658. Retries transient stale session-lock acquire failures when diagnostics show the old stale report disappeared, was replaced by a fresh valid lock, or was replaced by a fresh payload-less lock still inside the mtime/orphan grace window.

Preserves typed `SessionWriteLockStaleError` diagnostics for still-present live OpenClaw-owned stale locks.

Proof: 53 focused session-write-lock tests passed locally and in the agents-core CI shard; `pnpm tsgo:test:src`, touched-file oxlint, `git diff --check`, and autoreview passed locally. CI run 26716843811 has unrelated failures in UI deadcode/types and bash-tools tests; session-write-lock tests passed in that run.

Refs openclaw#87217.
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
Follow-up to openclaw#88658. Retries transient stale session-lock acquire failures when diagnostics show the old stale report disappeared, was replaced by a fresh valid lock, or was replaced by a fresh payload-less lock still inside the mtime/orphan grace window.

Preserves typed `SessionWriteLockStaleError` diagnostics for still-present live OpenClaw-owned stale locks.

Proof: 53 focused session-write-lock tests passed locally and in the agents-core CI shard; `pnpm tsgo:test:src`, touched-file oxlint, `git diff --check`, and autoreview passed locally. CI run 26716843811 has unrelated failures in UI deadcode/types and bash-tools tests; session-write-lock tests passed in that run.

Refs openclaw#87217.
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
Follow-up to openclaw#88658. Retries transient stale session-lock acquire failures when diagnostics show the old stale report disappeared, was replaced by a fresh valid lock, or was replaced by a fresh payload-less lock still inside the mtime/orphan grace window.

Preserves typed `SessionWriteLockStaleError` diagnostics for still-present live OpenClaw-owned stale locks.

Proof: 53 focused session-write-lock tests passed locally and in the agents-core CI shard; `pnpm tsgo:test:src`, touched-file oxlint, `git diff --check`, and autoreview passed locally. CI run 26716843811 has unrelated failures in UI deadcode/types and bash-tools tests; session-write-lock tests passed in that run.

Refs openclaw#87217.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling extensions: qa-lab maintainer Maintainer-authored PR merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: M status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant