Skip to content

fix(agents): bound models.json write-lock wait — steal from unsettled holders (tulgey#238)#11

Merged
matin merged 1 commit into
mainfrom
fix-models-lock-deadlock
Jun 5, 2026
Merged

fix(agents): bound models.json write-lock wait — steal from unsettled holders (tulgey#238)#11
matin merged 1 commit into
mainfrom
fix-models-lock-deadlock

Conversation

@matin

@matin matin commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Root cause of the per-session/process stall loops (tulgey#238), confirmed via heap analysis of a core captured mid-stall: withModelsJsonWriteLock awaits the prior holder unboundedly; one unsettled holder deadlocks every later model-resolution in the process, with zero observability.

Fix: bounded wait (60s default, env-tunable) + lock steal with a warn log. Regression tests: never-settling holder no longer deadlocks; normal serialization preserved (28/28 incl. existing write-serialization suite). tsgo clean.

Pairs with #10 (stage-stall watchdog) — if any sibling lock has the same shape, the journal now names it.

🤖 Generated with Claude Code

…tled holders

tulgey#238 root cause (confirmed by core-dump heap analysis of a live stall):
withModelsJsonWriteLock chains each caller onto the prior holder's gate with
no timeout. A holder that never settles (turn aborted while run() awaited a
hung provider call) leaves the gate unresolved forever, so every subsequent
model-resolution in the process parks on a dead promise — silent per-process
deadlock: no error, no timeout, no log; runs stall at embedded_run:started.

The wait on the prior holder is now bounded (OPENCLAW_MODELS_JSON_LOCK_WAIT_MS,
default 60s); on expiry the caller logs and steals the lock. A rare concurrent
models.json write (atomic temp-file rename) is recoverable; a deadlocked agent
is not. Same family as the session-lock fixes (#6, upstream openclaw#87278).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matin matin merged commit 9a1a211 into main Jun 5, 2026
@matin matin deleted the fix-models-lock-deadlock branch June 5, 2026 16:53
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@matin, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 33 minutes and 1 second. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08b97c6b-6440-4293-afe3-d6a696b35094

📥 Commits

Reviewing files that changed from the base of the PR and between d63adad and 546e669.

📒 Files selected for processing (2)
  • src/agents/models-config.lock-steal.test.ts
  • src/agents/models-config.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-models-lock-deadlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

matin added a commit that referenced this pull request Jun 5, 2026
…ch (#12)

* fix(queue): reclaim phantom lane slots and surface wedged dispatch

On the membrane deployment, WhatsApp DM agent turns intermittently
stalled forever at embedded_run:started: the liveness snapshot showed
active=0 waiting=0 queued=1 while one entry sat queued, the stage-stall
watchdog logged nothing, and no drainLane/lane-wait/lane-task logs fired
during the stall (tulgey#238).

Root cause (hypothesis 1, reproduced): the embedded-run session lane is
enqueued with NO finite taskTimeoutMs — only the inner global-lane enqueue
carried withLaneTimeout. A session-lane task whose promise never settles
holds its activeTaskIds slot forever. Subsequent enqueues pump, but the
while-loop sees activeTaskIds.size >= maxConcurrent and returns silently,
so queued work never dispatches and nothing logs. The stuck-session
recovery only resets the session lane (released=0 when the slot is the
phantom), so the loop re-enters. Same "unsettled promise leaves a slot
held" family as #11.

Fix (minimal + observable):
- Track per-active-task start + progress timestamps; in pump(), before
  giving up on a saturated lane, reclaim any slot held past a 10-minute
  no-progress ceiling (bumps generation so the late completion is ignored,
  mirroring the #11 steal pattern) and warn `lane slot reclaimed`.
- Emit a throttled `lane dispatch blocked` warn at the previously-silent
  decision point (lane, active, maxConcurrent, queued, oldestActiveAgeMs).
- Apply withLaneTimeout to the embedded-run session lane too, so the outer
  slot self-expires like the global one already did.

Tests (both hang/fail pre-fix): "reclaims a phantom slot whose untimed
task never settles" and "warns at the dispatch-blocked decision point".

Refs imperfect-co/tulgey#238

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(queue): reclaim only stalled taskIds, no generation bump

Address cr review: bumping state.generation in reclaimStalledSlots
invalidated every in-flight task on the lane, not just the phantom — on a
multi-concurrency lane a healthy sibling's completion would be silently
dropped (completeTask returns false, no pump/wake). Reclaim now removes
only the stalled taskIds; completeTask no-ops via a membership check so a
late completion of a reclaimed task frees nothing and does not re-pump.

Adds a multi-concurrency regression test: reclaiming one stalled slot
leaves a healthy sibling task intact.

Refs imperfect-co/tulgey#238

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(queue): make sibling-reclaim test prove only the phantom is reclaimed

Address cr review: give the healthy sibling a fresh taskTimeoutProgressAtMs
callback so the ceiling check spares it (previously it would also look
stalled, weakening the assertion), capture the trigger promise instead of
fire-and-forget, and assert the post-reclaim lane state synchronously
before the queued task's microtask frees its slot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
matin added a commit that referenced this pull request Jun 5, 2026
…ettled flushes (#15)

A flush whose downstream work never settles (hung pre-agent media call in
processMessage) parked every later inbound for that chat key on a dead promise
chain: messages logged as Inbound, never reached the command queue, and the
chat went permanently silent until restart re-primed the same poison. Fourth
member of the unbounded await-the-prior-holder family (session lock #6,
models.json #11, lane slot #12). Bounded at OPENCLAW_INBOUND_CHAIN_WAIT_MS
(default 5 min) with a warn naming the chat key.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant