fix(agents): retry stale session lock reports by giodl73-repo · Pull Request #87342 · openclaw/openclaw

giodl73-repo · 2026-05-27T16:03:09Z

Summary

acquireSessionWriteLock() could surface the raw file_lock_stale error from the sidecar lock manager when a stale-lock observation raced with recovery. In agent/tool execution that raw lock-manager message can look like actionable tool stderr even though the correct session-write behavior is to keep treating it as contention until the configured acquire timeout expires.

This change keeps stale lock reports inside the session write-lock policy: retry briefly within the existing acquire timeout, and if the lock remains unavailable, raise the existing SessionWriteLockTimeoutError shape instead of leaking file lock stale for .... The latest refresh preserves current-main lock cleanup behavior and the PR's stale-lock retry/timeout-budget behavior.

No config surface or plugin surface changes.

Real behavior proof

Behavior addressed: recovered or racey stale session write-lock reports no longer surface as raw file_lock_stale / file lock stale for ... errors during session journal acquisition.

Real environment tested: local OpenClaw Linux/WSL source worktree rebased onto current main (4dad7bd93b6caae036342fd4efbdb47c217b459f).

Exact steps or command run after this patch:

OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot
./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
git diff --check origin/main...HEAD
git diff --check
/home/giodl/.local/toolchains/node-v22.22.3-linux-x64/bin/codex review --base origin/main

Evidence after fix:

Test Files 1 passed (1)
Tests 49 passed (49)
All matched files use the correct format.
Found 0 warnings and 0 errors.
git diff --check origin/main...HEAD: passed
git diff --check: passed
Codex review: no actionable correctness issues found.

Observed result after fix: the stale-lock retry path keeps one acquire timeout budget, retries stale reports through session write-lock policy, and converts exhausted stale/timeout paths to the existing session lock timeout error shape.

Rebased signed head: 2696c8b89f94fa132a00fad3562b130db27473c8.

clawsweeper · 2026-05-27T16:05:57Z

Codex review: needs maintainer review before merge. Reviewed May 30, 2026, 11:14 PM ET / 03:14 UTC.

Summary
The PR updates session write-lock acquisition to retry file_lock_stale within one acquire timeout and adds Linux regression coverage for stale recovery and timeout-budget preservation.

PR surface: Source +55, Tests +68. Total +123 across 2 files.

Reproducibility: yes. Current main source plus the fs-safe sidecar-lock contract show file_lock_stale can escape acquireSessionWriteLock, and the PR supplies after-fix runtime output through persistCliTurnTranscript.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

Coordinate final ordering with the related session-lock PRs.
Run focused session-write-lock validation on the exact merge result before landing.

Risk before merge

[P1] Session write-lock acquisition gates transcript and agent session JSONL writes; a bad merge could stall writes, mistime out, or mishandle session state under contention.
[P1] Open related PRs fix(gateway): bound startup sidecar fanout #85399 and fix: preserve subagent delivery after lock stalls #86540 touch the same lock surface, so final ordering needs deliberate validation.
[P1] The supplied proof is strong for the PR head, but this recently moving session-lock surface should be checked on the exact merge result before landing.

Maintainer options:

Validate The Merge Result (recommended)
Before merge, reconcile ordering with the related session-lock PRs and rerun focused session-write-lock validation on the exact merge result.
Accept The Scoped Lock Risk
Maintainers can intentionally accept the remaining risk if they judge the current proof and clean merge state sufficient for this hot session-lock path.

Next step before merge

No automated repair is needed; maintainers should handle the protected-label review, sibling PR ordering, and final merge-result validation.

Security
Cleared: The diff does not change dependencies, CI, secrets, auth, network, or package execution surfaces; no concrete security or supply-chain concern was found.

Review details

Best possible solution:

Land the focused retry behavior after maintainer review confirms sibling PR ordering and focused session-lock validation passes on the exact merge result.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main source plus the fs-safe sidecar-lock contract show file_lock_stale can escape acquireSessionWriteLock, and the PR supplies after-fix runtime output through persistCliTurnTranscript.

Is this the best way to solve the issue?

Yes. Keeping stale-lock retry and timeout normalization inside acquireSessionWriteLock is the narrowest maintainable fix, provided maintainers validate the final merge result against sibling session-lock work.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4dad7bd93b6c.

Label changes

Label justifications:

P2: This is a normal-priority agent/session-state bug fix with limited but real user impact under lock contention.
merge-risk: 🚨 session-state: The PR changes session write-lock acquisition, which protects transcript and agent session JSONL state.
merge-risk: 🚨 availability: Incorrect retry or timeout behavior in this path can stall agent/tool persistence under lock contention.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix runtime output through the real CLI transcript persistence path plus focused validation on the rebased head.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix runtime output through the real CLI transcript persistence path plus focused validation on the rebased head.

Evidence reviewed

PR surface:

Source +55, Tests +68. Total +123 across 2 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	1	62	7	+55
Tests	1	68	0	+68
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	2	130	7	+123

Acceptance criteria:

[P1] OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot.
[P1] ./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts.
[P1] ./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts.
[P1] git diff --check origin/main...HEAD.

What I checked:

Repository policy read: Root policy treats session state, fallback behavior, and protected maintainer-labeled PRs as compatibility/merge-review sensitive; the scoped agents guide did not add a conflicting requirement for this patch. (AGENTS.md:13, 4dad7bd93b6c)
Current main behavior: Current acquireSessionWriteLock delegates to SESSION_LOCKS.acquire and only converts file_lock_timeout into SessionWriteLockTimeoutError; raw file_lock_stale errors are not normalized on main. (src/agents/session-write-lock.ts:862, 4dad7bd93b6c)
Dependency contract: @openclaw/fs-safe 0.3.0 sidecar locks throw file_lock_stale with lockPath/normalizedTargetPath when a stale lock is detected but not removed, matching the error shape this PR handles. (cb91a5e42b30)
PR diff reviewed: The PR catches file_lock_stale, retries with a bounded 50ms delay, preserves one caller-visible acquire timeout budget using startedAt, and converts exhausted stale/timeout paths to the existing session-lock timeout shape. (src/agents/session-write-lock.ts:816, 2696c8b89f94)
Regression coverage reviewed: The PR adds coverage for a stale lock report that becomes recoverable and for preserving the original timeout budget across stale retries. (src/agents/session-write-lock.test.ts:1157, 2696c8b89f94)
Real behavior proof supplied: The PR body and refresh comments include copied runtime output showing persistCliTurnTranscript completed through the session write-lock path without leaking raw file_lock_stale, plus focused test/format/lint/diff checks on the rebased head. (2696c8b89f94)

Likely related people:

steipete: Recent GitHub file history shows multiple session-lock and cleanup-related commits on the central file, including live-lock cleanup and timer-bound refactors. (role: recent area contributor; confidence: high; commits: 0b86decf948d, bf3921dab762, 00d8d7ead059; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts)
njuboy11: Authored the recent max-hold/session-lock acquisition change that shaped current stale-lock reclaim behavior in this file. (role: feature-history contributor; confidence: medium; commits: a1eb765f0a32; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts)
Vincent Koc: Local blame on the current acquireSessionWriteLock implementation attributes the central lock acquisition block to this commit in the shallow checkout. (role: current-main blamed contributor; confidence: medium; commits: 5367ef7bd3; files: src/agents/session-write-lock.ts)
wAngByg: Recent file history shows a payload-less session-lock grace fix touching the same session write-lock behavior and regression tests. (role: recent adjacent contributor; confidence: medium; commits: eb170a0adb19; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

clawsweeper · 2026-05-27T16:11:14Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

giodl73-repo · 2026-05-27T21:16:44Z

Refreshed #87342 onto current main; new signed head is 87a2980be364951a200d42a0e64bf3cbc65fb1da.

Focused WSL validation after the refresh:

node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot -> 2 files, 84 tests passed
./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
git diff --check origin/main...HEAD
git diff --check

This should replace the old broad CI failures from the earlier head. Remaining ClawSweeper ask is real agent/tool runtime proof for the stale-lock stderr leak. No merge performed.

giodl73-repo · 2026-05-29T14:34:22Z

@clawsweeper re-review

Retrying after the prior ClawSweeper/Codex review failed before summarizing the patch. Branch is mergeable at head 87a2980; no code changes in this comment.

clawsweeper · 2026-05-29T14:34:26Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26643425315
Updated: 2026-05-29T14:45:04.225Z

giodl73-repo · 2026-05-29T23:38:51Z

@clawsweeper re-review

Updated #87342 with a current-main rebase and added real runtime proof for the stale-lock stderr ask.

Branch/head under test: fix-87340-session-lock-stale-stderr at 0488f01bec84bd61bc9f1c8cac13671644ac3b05.

Rebase/coordination:

Rebases the previous head 87a2980be364951a200d42a0e64bf3cbc65fb1da onto current origin/main.
Resolved the session-lock conflict by preserving both main's report-only stale cleanup reasons and this PR's stale-lock retry behavior.
No config, plugin, CLI, env, migration, or public surface added.

Real runtime proof:

NODE_PATH=/root/src/openclaw-proof-87342-lock/node_modules \
OPENCLAW_PROOF_HEAD=0488f01bec84bd61bc9f1c8cac13671644ac3b05 \
node --import tsx /mnt/c/tmp/openclaw-87342-runtime-proof.ts

The proof harness exercises the real agent command transcript persistence path (persistCliTurnTranscript), not just acquireSessionWriteLock directly:

seeds a session .jsonl.lock with the current pid and a stale/racey starttime;
makes the lock inspection first report the stale/recycled-pid condition, then become recoverable;
runs CLI-turn transcript persistence through the normal session write-lock path;
captures process.stderr;
asserts the transcript append completed, the lock was released, and no raw file_lock_stale / file lock stale text reached stderr.

Result:

{
  "status": "pass",
  "branchHead": "0488f01bec84bd61bc9f1c8cac13671644ac3b05",
  "runtimePath": "persistCliTurnTranscript",
  "resolverCalls": 5,
  "stderrContainsFileLockStale": false,
  "transcriptAppended": true,
  "lockReleased": true,
  "sessionFileSuffix": "proof.jsonl",
  "sessionEntryHasFile": true
}

Focused validation on rebased head:

OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot

Test Files  2 passed (2)
Tests       92 passed (92)

./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
All matched files use the correct format.

./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
Found 0 warnings and 0 errors.

git diff --check origin/main...HEAD
git diff --check
passed

clawsweeper · 2026-05-29T23:38:54Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26667754843
Updated: 2026-05-29T23:44:58.537Z

giodl73-repo · 2026-05-30T00:58:07Z

@clawsweeper re-review

Updated #87342 for the ClawSweeper P2 finding about stale-lock retries resetting the acquire timeout budget.

Changes on PR head 871d479d789:

acquireSessionWriteLock now computes the remaining acquire timeout before each SESSION_LOCKS.acquire attempt, so a file_lock_stale retry preserves one caller-visible timeout budget instead of restarting the full timeout.
Added a focused regression proving a stale report followed by timeout re-enters the lock manager with the remaining budget ([90, 1]) rather than the original timeout.

Validation in WSL from /root/src/openclaw-proof-87342-lock:

OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot -> 2 files, 94 tests passed
./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts -> all matched files use the correct format
./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts -> 0 warnings, 0 errors
git diff --check origin/main...HEAD
git diff --check

No public config, plugin API, CLI flag, env var, migration, or plugin contract surface was added.

clawsweeper · 2026-05-30T00:58:09Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26669974389
Updated: 2026-05-30T01:03:12.526Z

giodl73-repo · 2026-05-31T03:04:54Z

Updated #87342 onto current main and resolved the merge conflict.

New signed head: 2696c8b89f94fa132a00fad3562b130db27473c8.

What changed in this refresh:

rebased onto current main (4dad7bd93b6caae036342fd4efbdb47c217b459f)
resolved the session write-lock conflict by preserving current-main lock cleanup imports/behavior and this PR's stale-lock retry + acquire-timeout-budget behavior
kept the diff scoped to src/agents/session-write-lock.ts and src/agents/session-write-lock.test.ts
no config surface or plugin surface changes

Focused validation:

OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts --reporter=dot -> 1 file, 49 tests passed
./node_modules/.bin/oxfmt --check src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
./node_modules/.bin/oxlint --tsconfig config/tsconfig/oxlint.core.json src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts
git diff --check origin/main...HEAD
git diff --check
Codex review found no actionable correctness issues

No merge performed.

@clawsweeper re-review

clawsweeper · 2026-05-31T03:04:57Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26701810201
Updated: 2026-05-31T03:14:32.853Z

steipete · 2026-05-31T14:34:36Z

Thanks for working through this and for keeping the branch refreshed.

We landed the stale-lock policy in #88658 instead, at commit 7ca7712.

The important maintainer decision is that a live OpenClaw-owned stale lock should remain visible as a typed stale lock acquisition failure, with owner diagnostics, rather than being converted into the existing timeout path. Dead/orphaned/recycled locks are still safely reclaimable, but live-owned stale locks are not silently removed or hidden behind timeout-style retry because that can obscure the writer contention we need to diagnose.

So I am closing this as superseded by #88658, not because the report was invalid. The raw stale-lock leak path is now handled by the typed SessionWriteLockStaleError surface, and any remaining user-visible stale-lock behavior should continue in the canonical follow-up issues rather than this conflicting retry policy.

openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 27, 2026

giodl73-repo force-pushed the fix-87340-session-lock-stale-stderr branch from 80d61ba to 87a2980 Compare May 27, 2026 21:16

clawsweeper Bot mentioned this pull request May 28, 2026

file_lock_stale race condition under rapid subagent spawns #87537

Closed

This comment was marked as spam.

Sign in to view

RomneyDa added dependency-guard-backfill and removed dependency-guard-backfill labels May 29, 2026

giodl73-repo force-pushed the fix-87340-session-lock-stale-stderr branch from 87a2980 to 0488f01 Compare May 29, 2026 23:38

clawsweeper Bot added status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 29, 2026

giodl73-repo added 2 commits May 30, 2026 19:58

fix(agents): retry stale session lock reports

8d16c24

fix(agents): preserve stale lock acquire budget

2696c8b

giodl73-repo force-pushed the fix-87340-session-lock-stale-stderr branch from 871d479 to 2696c8b Compare May 31, 2026 03:03

steipete closed this May 31, 2026

itanyplus mentioned this pull request Jun 10, 2026

[Bug]: Session write-lock timeouts block subagent delivery lanes #86538

Open

Uh oh!

Conversation

giodl73-repo commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Real behavior proof

Uh oh!

clawsweeper Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 27, 2026

Uh oh!

giodl73-repo commented May 27, 2026

Uh oh!

This comment was marked as spam.

giodl73-repo commented May 29, 2026

Uh oh!

clawsweeper Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giodl73-repo commented May 29, 2026

Uh oh!

clawsweeper Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giodl73-repo commented May 30, 2026

Uh oh!

clawsweeper Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giodl73-repo commented May 31, 2026

Uh oh!

clawsweeper Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steipete commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

giodl73-repo commented May 27, 2026 •

edited

Loading

clawsweeper Bot commented May 27, 2026 •

edited

Loading

clawsweeper Bot commented May 29, 2026 •

edited

Loading

clawsweeper Bot commented May 29, 2026 •

edited

Loading

clawsweeper Bot commented May 30, 2026 •

edited

Loading

clawsweeper Bot commented May 31, 2026 •

edited

Loading