fix(agents): bound models.json write-lock wait — steal from unsettled holders (tulgey#238) by matin · Pull Request #11 · matin/openclaw

matin · 2026-06-05T16:53:03Z

Root cause of the per-session/process stall loops (tulgey#238), confirmed via heap analysis of a core captured mid-stall: withModelsJsonWriteLock awaits the prior holder unboundedly; one unsettled holder deadlocks every later model-resolution in the process, with zero observability.

Fix: bounded wait (60s default, env-tunable) + lock steal with a warn log. Regression tests: never-settling holder no longer deadlocks; normal serialization preserved (28/28 incl. existing write-serialization suite). tsgo clean.

Pairs with #10 (stage-stall watchdog) — if any sibling lock has the same shape, the journal now names it.

🤖 Generated with Claude Code

…tled holders tulgey#238 root cause (confirmed by core-dump heap analysis of a live stall): withModelsJsonWriteLock chains each caller onto the prior holder's gate with no timeout. A holder that never settles (turn aborted while run() awaited a hung provider call) leaves the gate unresolved forever, so every subsequent model-resolution in the process parks on a dead promise — silent per-process deadlock: no error, no timeout, no log; runs stall at embedded_run:started. The wait on the prior holder is now bounded (OPENCLAW_MODELS_JSON_LOCK_WAIT_MS, default 60s); on expiry the caller logs and steals the lock. A rare concurrent models.json write (atomic temp-file rename) is recoverable; a deadlocked agent is not. Same family as the session-lock fixes (#6, upstream openclaw#87278). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-05T16:53:10Z

Warning

Review limit reached

@matin, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 33 minutes and 1 second. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08b97c6b-6440-4293-afe3-d6a696b35094

📥 Commits

Reviewing files that changed from the base of the PR and between d63adad and 546e669.

📒 Files selected for processing (2)

src/agents/models-config.lock-steal.test.ts
src/agents/models-config.ts

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-models-lock-deadlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ch (#12) * fix(queue): reclaim phantom lane slots and surface wedged dispatch On the membrane deployment, WhatsApp DM agent turns intermittently stalled forever at embedded_run:started: the liveness snapshot showed active=0 waiting=0 queued=1 while one entry sat queued, the stage-stall watchdog logged nothing, and no drainLane/lane-wait/lane-task logs fired during the stall (tulgey#238). Root cause (hypothesis 1, reproduced): the embedded-run session lane is enqueued with NO finite taskTimeoutMs — only the inner global-lane enqueue carried withLaneTimeout. A session-lane task whose promise never settles holds its activeTaskIds slot forever. Subsequent enqueues pump, but the while-loop sees activeTaskIds.size >= maxConcurrent and returns silently, so queued work never dispatches and nothing logs. The stuck-session recovery only resets the session lane (released=0 when the slot is the phantom), so the loop re-enters. Same "unsettled promise leaves a slot held" family as #11. Fix (minimal + observable): - Track per-active-task start + progress timestamps; in pump(), before giving up on a saturated lane, reclaim any slot held past a 10-minute no-progress ceiling (bumps generation so the late completion is ignored, mirroring the #11 steal pattern) and warn `lane slot reclaimed`. - Emit a throttled `lane dispatch blocked` warn at the previously-silent decision point (lane, active, maxConcurrent, queued, oldestActiveAgeMs). - Apply withLaneTimeout to the embedded-run session lane too, so the outer slot self-expires like the global one already did. Tests (both hang/fail pre-fix): "reclaims a phantom slot whose untimed task never settles" and "warns at the dispatch-blocked decision point". Refs imperfect-co/tulgey#238 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(queue): reclaim only stalled taskIds, no generation bump Address cr review: bumping state.generation in reclaimStalledSlots invalidated every in-flight task on the lane, not just the phantom — on a multi-concurrency lane a healthy sibling's completion would be silently dropped (completeTask returns false, no pump/wake). Reclaim now removes only the stalled taskIds; completeTask no-ops via a membership check so a late completion of a reclaimed task frees nothing and does not re-pump. Adds a multi-concurrency regression test: reclaiming one stalled slot leaves a healthy sibling task intact. Refs imperfect-co/tulgey#238 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(queue): make sibling-reclaim test prove only the phantom is reclaimed Address cr review: give the healthy sibling a fresh taskTimeoutProgressAtMs callback so the ceiling check spares it (previously it would also look stalled, weakening the assertion), capture the trigger promise instead of fire-and-forget, and assert the post-reclaim lane state synchronously before the queued task's microtask frees its slot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ettled flushes (#15) A flush whose downstream work never settles (hung pre-agent media call in processMessage) parked every later inbound for that chat key on a dead promise chain: messages logged as Inbound, never reached the command queue, and the chat went permanently silent until restart re-primed the same poison. Fourth member of the unbounded await-the-prior-holder family (session lock #6, models.json #11, lane slot #12). Bounded at OPENCLAW_INBOUND_CHAIN_WAIT_MS (default 5 min) with a warn naming the chat key. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

matin merged commit 9a1a211 into main Jun 5, 2026

matin deleted the fix-models-lock-deadlock branch June 5, 2026 16:53

matin mentioned this pull request Jun 5, 2026

fix(queue): reclaim phantom lane slots that wedge embedded-run dispatch #12

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): bound models.json write-lock wait — steal from unsettled holders (tulgey#238)#11

fix(agents): bound models.json write-lock wait — steal from unsettled holders (tulgey#238)#11
matin merged 1 commit into
mainfrom
fix-models-lock-deadlock

matin commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Review limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matin commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Review limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant