fix(agents): bound models.json write-lock wait — steal from unsettled holders (tulgey#238)#11
Conversation
…tled holders tulgey#238 root cause (confirmed by core-dump heap analysis of a live stall): withModelsJsonWriteLock chains each caller onto the prior holder's gate with no timeout. A holder that never settles (turn aborted while run() awaited a hung provider call) leaves the gate unresolved forever, so every subsequent model-resolution in the process parks on a dead promise — silent per-process deadlock: no error, no timeout, no log; runs stall at embedded_run:started. The wait on the prior holder is now bounded (OPENCLAW_MODELS_JSON_LOCK_WAIT_MS, default 60s); on expiry the caller logs and steals the lock. A rare concurrent models.json write (atomic temp-file rename) is recoverable; a deadlocked agent is not. Same family as the session-lock fixes (#6, upstream openclaw#87278). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 33 minutes and 1 second. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ch (#12) * fix(queue): reclaim phantom lane slots and surface wedged dispatch On the membrane deployment, WhatsApp DM agent turns intermittently stalled forever at embedded_run:started: the liveness snapshot showed active=0 waiting=0 queued=1 while one entry sat queued, the stage-stall watchdog logged nothing, and no drainLane/lane-wait/lane-task logs fired during the stall (tulgey#238). Root cause (hypothesis 1, reproduced): the embedded-run session lane is enqueued with NO finite taskTimeoutMs — only the inner global-lane enqueue carried withLaneTimeout. A session-lane task whose promise never settles holds its activeTaskIds slot forever. Subsequent enqueues pump, but the while-loop sees activeTaskIds.size >= maxConcurrent and returns silently, so queued work never dispatches and nothing logs. The stuck-session recovery only resets the session lane (released=0 when the slot is the phantom), so the loop re-enters. Same "unsettled promise leaves a slot held" family as #11. Fix (minimal + observable): - Track per-active-task start + progress timestamps; in pump(), before giving up on a saturated lane, reclaim any slot held past a 10-minute no-progress ceiling (bumps generation so the late completion is ignored, mirroring the #11 steal pattern) and warn `lane slot reclaimed`. - Emit a throttled `lane dispatch blocked` warn at the previously-silent decision point (lane, active, maxConcurrent, queued, oldestActiveAgeMs). - Apply withLaneTimeout to the embedded-run session lane too, so the outer slot self-expires like the global one already did. Tests (both hang/fail pre-fix): "reclaims a phantom slot whose untimed task never settles" and "warns at the dispatch-blocked decision point". Refs imperfect-co/tulgey#238 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(queue): reclaim only stalled taskIds, no generation bump Address cr review: bumping state.generation in reclaimStalledSlots invalidated every in-flight task on the lane, not just the phantom — on a multi-concurrency lane a healthy sibling's completion would be silently dropped (completeTask returns false, no pump/wake). Reclaim now removes only the stalled taskIds; completeTask no-ops via a membership check so a late completion of a reclaimed task frees nothing and does not re-pump. Adds a multi-concurrency regression test: reclaiming one stalled slot leaves a healthy sibling task intact. Refs imperfect-co/tulgey#238 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(queue): make sibling-reclaim test prove only the phantom is reclaimed Address cr review: give the healthy sibling a fresh taskTimeoutProgressAtMs callback so the ceiling check spares it (previously it would also look stalled, weakening the assertion), capture the trigger promise instead of fire-and-forget, and assert the post-reclaim lane state synchronously before the queued task's microtask frees its slot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ettled flushes (#15) A flush whose downstream work never settles (hung pre-agent media call in processMessage) parked every later inbound for that chat key on a dead promise chain: messages logged as Inbound, never reached the command queue, and the chat went permanently silent until restart re-primed the same poison. Fourth member of the unbounded await-the-prior-holder family (session lock #6, models.json #11, lane slot #12). Bounded at OPENCLAW_INBOUND_CHAIN_WAIT_MS (default 5 min) with a warn naming the chat key. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root cause of the per-session/process stall loops (tulgey#238), confirmed via heap analysis of a core captured mid-stall:
withModelsJsonWriteLockawaits the prior holder unboundedly; one unsettled holder deadlocks every later model-resolution in the process, with zero observability.Fix: bounded wait (60s default, env-tunable) + lock steal with a warn log. Regression tests: never-settling holder no longer deadlocks; normal serialization preserved (28/28 incl. existing write-serialization suite). tsgo clean.
Pairs with #10 (stage-stall watchdog) — if any sibling lock has the same shape, the journal now names it.
🤖 Generated with Claude Code