Skip to content

fix(agents): retry stale session lock reports#87342

Closed
giodl73-repo wants to merge 2 commits into
mainfrom
fix-87340-session-lock-stale-stderr
Closed

fix(agents): retry stale session lock reports#87342
giodl73-repo wants to merge 2 commits into
mainfrom
fix-87340-session-lock-stale-stderr

Conversation

@giodl73-repo

@giodl73-repo giodl73-repo commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #87340.

acquireSessionWriteLock() could surface the raw file_lock_stale error from the sidecar lock manager when a stale-lock observation raced with recovery. In agent/tool execution that raw lock-manager message can look like actionable tool stderr even though the correct session-write behavior is to keep treating it as contention until the configured acquire timeout expires.

This change keeps stale lock reports inside the session write-lock policy: retry briefly within the existing acquire timeout, and if the lock remains unavailable, raise the existing SessionWriteLockTimeoutError shape instead of leaking file lock stale for .... The latest refresh preserves current-main lock cleanup behavior and the PR's stale-lock retry/timeout-budget behavior.

No config surface or plugin surface changes.

Real behavior proof

Behavior addressed: recovered or racey stale session write-lock reports no longer surface as raw file_lock_stale / file lock stale for ... errors during session journal acquisition.

Real environment tested: local OpenClaw Linux/WSL source worktree rebased onto current main (4dad7bd93b6caae036342fd4efbdb47c217b459f).

Exact steps or command run after this patch:

OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot
./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
git diff --check origin/main...HEAD
git diff --check
/home/giodl/.local/toolchains/node-v22.22.3-linux-x64/bin/codex review --base origin/main

Evidence after fix:

Test Files 1 passed (1)
Tests 49 passed (49)
All matched files use the correct format.
Found 0 warnings and 0 errors.
git diff --check origin/main...HEAD: passed
git diff --check: passed
Codex review: no actionable correctness issues found.

Observed result after fix: the stale-lock retry path keeps one acquire timeout budget, retries stale reports through session write-lock policy, and converts exhausted stale/timeout paths to the existing session lock timeout error shape.

Rebased signed head: 2696c8b89f94fa132a00fad3562b130db27473c8.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 27, 2026
@clawsweeper

clawsweeper Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 30, 2026, 11:14 PM ET / 03:14 UTC.

Summary
The PR updates session write-lock acquisition to retry file_lock_stale within one acquire timeout and adds Linux regression coverage for stale recovery and timeout-budget preservation.

PR surface: Source +55, Tests +68. Total +123 across 2 files.

Reproducibility: yes. Current main source plus the fs-safe sidecar-lock contract show file_lock_stale can escape acquireSessionWriteLock, and the PR supplies after-fix runtime output through persistCliTurnTranscript.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Coordinate final ordering with the related session-lock PRs.
  • Run focused session-write-lock validation on the exact merge result before landing.

Risk before merge

  • [P1] Session write-lock acquisition gates transcript and agent session JSONL writes; a bad merge could stall writes, mistime out, or mishandle session state under contention.
  • [P1] Open related PRs fix(gateway): bound startup sidecar fanout #85399 and fix: preserve subagent delivery after lock stalls #86540 touch the same lock surface, so final ordering needs deliberate validation.
  • [P1] The supplied proof is strong for the PR head, but this recently moving session-lock surface should be checked on the exact merge result before landing.

Maintainer options:

  1. Validate The Merge Result (recommended)
    Before merge, reconcile ordering with the related session-lock PRs and rerun focused session-write-lock validation on the exact merge result.
  2. Accept The Scoped Lock Risk
    Maintainers can intentionally accept the remaining risk if they judge the current proof and clean merge state sufficient for this hot session-lock path.

Next step before merge

  • No automated repair is needed; maintainers should handle the protected-label review, sibling PR ordering, and final merge-result validation.

Security
Cleared: The diff does not change dependencies, CI, secrets, auth, network, or package execution surfaces; no concrete security or supply-chain concern was found.

Review details

Best possible solution:

Land the focused retry behavior after maintainer review confirms sibling PR ordering and focused session-lock validation passes on the exact merge result.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main source plus the fs-safe sidecar-lock contract show file_lock_stale can escape acquireSessionWriteLock, and the PR supplies after-fix runtime output through persistCliTurnTranscript.

Is this the best way to solve the issue?

Yes. Keeping stale-lock retry and timeout normalization inside acquireSessionWriteLock is the narrowest maintainable fix, provided maintainers validate the final merge result against sibling session-lock work.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4dad7bd93b6c.

Label changes

Label justifications:

  • P2: This is a normal-priority agent/session-state bug fix with limited but real user impact under lock contention.
  • merge-risk: 🚨 session-state: The PR changes session write-lock acquisition, which protects transcript and agent session JSONL state.
  • merge-risk: 🚨 availability: Incorrect retry or timeout behavior in this path can stall agent/tool persistence under lock contention.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix runtime output through the real CLI transcript persistence path plus focused validation on the rebased head.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix runtime output through the real CLI transcript persistence path plus focused validation on the rebased head.
Evidence reviewed

PR surface:

Source +55, Tests +68. Total +123 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 62 7 +55
Tests 1 68 0 +68
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 130 7 +123

Acceptance criteria:

  • [P1] OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot.
  • [P1] ./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts.
  • [P1] ./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts.
  • [P1] git diff --check origin/main...HEAD.

What I checked:

  • Repository policy read: Root policy treats session state, fallback behavior, and protected maintainer-labeled PRs as compatibility/merge-review sensitive; the scoped agents guide did not add a conflicting requirement for this patch. (AGENTS.md:13, 4dad7bd93b6c)
  • Current main behavior: Current acquireSessionWriteLock delegates to SESSION_LOCKS.acquire and only converts file_lock_timeout into SessionWriteLockTimeoutError; raw file_lock_stale errors are not normalized on main. (src/agents/session-write-lock.ts:862, 4dad7bd93b6c)
  • Dependency contract: @openclaw/fs-safe 0.3.0 sidecar locks throw file_lock_stale with lockPath/normalizedTargetPath when a stale lock is detected but not removed, matching the error shape this PR handles. (cb91a5e42b30)
  • PR diff reviewed: The PR catches file_lock_stale, retries with a bounded 50ms delay, preserves one caller-visible acquire timeout budget using startedAt, and converts exhausted stale/timeout paths to the existing session-lock timeout shape. (src/agents/session-write-lock.ts:816, 2696c8b89f94)
  • Regression coverage reviewed: The PR adds coverage for a stale lock report that becomes recoverable and for preserving the original timeout budget across stale retries. (src/agents/session-write-lock.test.ts:1157, 2696c8b89f94)
  • Real behavior proof supplied: The PR body and refresh comments include copied runtime output showing persistCliTurnTranscript completed through the session write-lock path without leaking raw file_lock_stale, plus focused test/format/lint/diff checks on the rebased head. (2696c8b89f94)

Likely related people:

  • steipete: Recent GitHub file history shows multiple session-lock and cleanup-related commits on the central file, including live-lock cleanup and timer-bound refactors. (role: recent area contributor; confidence: high; commits: 0b86decf948d, bf3921dab762, 00d8d7ead059; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts)
  • njuboy11: Authored the recent max-hold/session-lock acquisition change that shaped current stale-lock reclaim behavior in this file. (role: feature-history contributor; confidence: medium; commits: a1eb765f0a32; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts)
  • Vincent Koc: Local blame on the current acquireSessionWriteLock implementation attributes the central lock acquisition block to this commit in the shallow checkout. (role: current-main blamed contributor; confidence: medium; commits: 5367ef7bd3; files: src/agents/session-write-lock.ts)
  • wAngByg: Recent file history shows a payload-less session-lock grace fix touching the same session write-lock behavior and regression tests. (role: recent adjacent contributor; confidence: medium; commits: eb170a0adb19; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 27, 2026
@clawsweeper

clawsweeper Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@giodl73-repo giodl73-repo force-pushed the fix-87340-session-lock-stale-stderr branch from 80d61ba to 87a2980 Compare May 27, 2026 21:16
@giodl73-repo

Copy link
Copy Markdown
Contributor Author

Refreshed #87342 onto current main; new signed head is 87a2980be364951a200d42a0e64bf3cbc65fb1da.

Focused WSL validation after the refresh:

  • node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot -> 2 files, 84 tests passed
  • ./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
  • ./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
  • git diff --check origin/main...HEAD
  • git diff --check

This should replace the old broad CI failures from the earlier head. Remaining ClawSweeper ask is real agent/tool runtime proof for the stale-lock stderr leak. No merge performed.

@BingqingLyu

This comment was marked as spam.

@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 29, 2026
@giodl73-repo

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Retrying after the prior ClawSweeper/Codex review failed before summarizing the patch. Branch is mergeable at head 87a2980; no code changes in this comment.

@clawsweeper

clawsweeper Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels May 29, 2026
@giodl73-repo giodl73-repo force-pushed the fix-87340-session-lock-stale-stderr branch from 87a2980 to 0488f01 Compare May 29, 2026 23:38
@giodl73-repo

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Updated #87342 with a current-main rebase and added real runtime proof for the stale-lock stderr ask.

Branch/head under test: fix-87340-session-lock-stale-stderr at 0488f01bec84bd61bc9f1c8cac13671644ac3b05.

Rebase/coordination:

  • Rebases the previous head 87a2980be364951a200d42a0e64bf3cbc65fb1da onto current origin/main.
  • Resolved the session-lock conflict by preserving both main's report-only stale cleanup reasons and this PR's stale-lock retry behavior.
  • No config, plugin, CLI, env, migration, or public surface added.

Real runtime proof:

NODE_PATH=/root/src/openclaw-proof-87342-lock/node_modules \
OPENCLAW_PROOF_HEAD=0488f01bec84bd61bc9f1c8cac13671644ac3b05 \
node --import tsx /mnt/c/tmp/openclaw-87342-runtime-proof.ts

The proof harness exercises the real agent command transcript persistence path (persistCliTurnTranscript), not just acquireSessionWriteLock directly:

  • seeds a session .jsonl.lock with the current pid and a stale/racey starttime;
  • makes the lock inspection first report the stale/recycled-pid condition, then become recoverable;
  • runs CLI-turn transcript persistence through the normal session write-lock path;
  • captures process.stderr;
  • asserts the transcript append completed, the lock was released, and no raw file_lock_stale / file lock stale text reached stderr.

Result:

{
  "status": "pass",
  "branchHead": "0488f01bec84bd61bc9f1c8cac13671644ac3b05",
  "runtimePath": "persistCliTurnTranscript",
  "resolverCalls": 5,
  "stderrContainsFileLockStale": false,
  "transcriptAppended": true,
  "lockReleased": true,
  "sessionFileSuffix": "proof.jsonl",
  "sessionEntryHasFile": true
}

Focused validation on rebased head:

OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot

Test Files  2 passed (2)
Tests       92 passed (92)

./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
All matched files use the correct format.

./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
Found 0 warnings and 0 errors.

git diff --check origin/main...HEAD
git diff --check
passed

@clawsweeper

clawsweeper Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 29, 2026
@clawsweeper clawsweeper Bot added status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 29, 2026
@giodl73-repo

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Updated #87342 for the ClawSweeper P2 finding about stale-lock retries resetting the acquire timeout budget.

Changes on PR head 871d479d789:

  • acquireSessionWriteLock now computes the remaining acquire timeout before each SESSION_LOCKS.acquire attempt, so a file_lock_stale retry preserves one caller-visible timeout budget instead of restarting the full timeout.
  • Added a focused regression proving a stale report followed by timeout re-enters the lock manager with the remaining budget ([90, 1]) rather than the original timeout.

Validation in WSL from /root/src/openclaw-proof-87342-lock:

  • OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot -> 2 files, 94 tests passed
  • ./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts -> all matched files use the correct format
  • ./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts -> 0 warnings, 0 errors
  • git diff --check origin/main...HEAD
  • git diff --check

No public config, plugin API, CLI flag, env var, migration, or plugin contract surface was added.

@clawsweeper

clawsweeper Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 30, 2026
@giodl73-repo giodl73-repo force-pushed the fix-87340-session-lock-stale-stderr branch from 871d479 to 2696c8b Compare May 31, 2026 03:03

Copy link
Copy Markdown
Contributor Author

Updated #87342 onto current main and resolved the merge conflict.

New signed head: 2696c8b89f94fa132a00fad3562b130db27473c8.

What changed in this refresh:

  • rebased onto current main (4dad7bd93b6caae036342fd4efbdb47c217b459f)
  • resolved the session write-lock conflict by preserving current-main lock cleanup imports/behavior and this PR's stale-lock retry + acquire-timeout-budget behavior
  • kept the diff scoped to src/agents/session-write-lock.ts and src/agents/session-write-lock.test.ts
  • no config surface or plugin surface changes

Focused validation:

  • OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot -> 1 file, 49 tests passed
  • ./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
  • ./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
  • git diff --check origin/main...HEAD
  • git diff --check
  • Codex review found no actionable correctness issues

No merge performed.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@steipete

Copy link
Copy Markdown
Contributor

Thanks for working through this and for keeping the branch refreshed.

We landed the stale-lock policy in #88658 instead, at commit 7ca7712.

The important maintainer decision is that a live OpenClaw-owned stale lock should remain visible as a typed stale lock acquisition failure, with owner diagnostics, rather than being converted into the existing timeout path. Dead/orphaned/recycled locks are still safely reclaimable, but live-owned stale locks are not silently removed or hidden behind timeout-style retry because that can obscure the writer contention we need to diagnose.

So I am closing this as superseded by #88658, not because the report was invalid. The raw stale-lock leak path is now handled by the typed SessionWriteLockStaleError surface, and any remaining user-visible stale-lock behavior should continue in the canonical follow-up issues rather than this conflicting retry policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: S status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session write lock stale warning leaks to tool stderr during parallel tool calls

4 participants