-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
Lane queue has no task-level timeout — hung promises permanently block session lanes #48488
Copy link
Copy link
Open
BingqingLyu/openclaw
#899Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:not-repro-on-mainClawSweeper found high-confidence evidence that this issue no longer reproduces on main.ClawSweeper found high-confidence evidence that this issue no longer reproduces on main.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦪 silver shellfishThin issue quality; more reproduction proof or environment detail is needed.Thin issue quality; more reproduction proof or environment detail is needed.staleMarked as stale due to inactivityMarked as stale due to inactivity
Metadata
Metadata
Assignees
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:not-repro-on-mainClawSweeper found high-confidence evidence that this issue no longer reproduces on main.ClawSweeper found high-confidence evidence that this issue no longer reproduces on main.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦪 silver shellfishThin issue quality; more reproduction proof or environment detail is needed.Thin issue quality; more reproduction proof or environment detail is needed.staleMarked as stale due to inactivityMarked as stale due to inactivity
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
Session lanes in the gateway's command queue (
src/process/command-queue.ts) have no task-level timeout. If an enqueued task's promise never settles, the lane is permanently jammed with no automatic recovery. This affects all messaging channels and cron.Symptom
Webchat session stops responding permanently. Gateway is healthy (memory, CPU,
/healthall normal), but the session lane is dead:New messages queue up behind the stuck task and wait forever. The session never recovers without a gateway restart.
Root Cause
In
pump()(src/process/command-queue.ts, lines 118-143), each dequeued task is awaited with no timeout protection:If
entry.task()never resolves or rejects:completeTask()never runsactiveTaskIdsretains the stale task IDpump()is never called againmaxConcurrent=1(hardcoded ingetLaneState, lines 67-74), the lane is permanently blockedThe only recovery is
resetAllLanes()(lines 251-266), which requires a SIGUSR1 gateway restart. There is no automatic detection, health check, or recovery mechanism for stuck lanes.How It Happens
Any scenario where an enqueued task's promise hangs:
AbortSignalfromscheduleAbortTimerfires but the underlying HTTP fetch doesn't honor itThe agent runner's internal timeout (
scheduleAbortTimerinrun/attempt.ts) only works if the task code checks the abort signal. If the underlying fetch call is hung at the OS/socket level, the abort signal may not terminate it, and the lane queue'sawait entry.task()remains suspended indefinitely.Affected Channels
All channels route through the same lane system via
enqueueCommandInLanewith session lanes (maxConcurrent=1):CommandLane.Cron)Environment
{"ok":true,"status":"live"}stale-socketat 20:25:19, shortly before the lane jammedRelated Issues
cron runalways times out at 30sThese all share the same underlying pattern: work enters the lane queue and never completes, with no automatic recovery.
Suggested Fix Directions
For maintainer consideration — several approaches could address this, each with trade-offs:
a)
Promise.racewrapper inpump()— Race each task against a configurable timeout promise. If the timeout wins, reject the entry, clearactiveTaskIds, and callpump(). Simple and targeted, but creates "zombie task" concerns (the original hung promise keeps running in the background).b) Periodic lane health monitor — A background interval that checks for lanes where
activeTaskIds.size > 0and no progress has been made for N seconds. Could auto-clear stale tasks or triggerresetAllLanes()for just the affected lane. More defensive but adds runtime complexity.c) Better abort signal propagation — Ensure
scheduleAbortTimeractually terminates the underlying HTTP fetch (viaAbortControlleron the fetch call itself, not just the agent-level signal). Fixes the root cause but requires changes deeper in the API call stack.d) Combination — Defense in depth: fix abort propagation (c) to prevent most hangs, add a queue-level timeout (a) as a safety net, and a health monitor (b) as a last resort.
Open Questions
pump()deliberate? (e.g., to avoid killing legitimately long-running tasks like compaction)taskTimeoutMsoption onenqueueCommandInLanebe acceptable?