Skip to content

Lane queue has no task-level timeout — hung promises permanently block session lanes #48488

@kyletabor

Description

@kyletabor

Summary

Session lanes in the gateway's command queue (src/process/command-queue.ts) have no task-level timeout. If an enqueued task's promise never settles, the lane is permanently jammed with no automatic recovery. This affects all messaging channels and cron.

Symptom

Webchat session stops responding permanently. Gateway is healthy (memory, CPU, /health all normal), but the session lane is dead:

[diagnostic] lane wait exceeded: lane=session:agent:main:treehouse-sess-mmthhrfo-6vvcjy waitedMs=215576 queueAhead=2
[diagnostic] lane wait exceeded: lane=session:agent:main:treehouse-sess-mmthhrfo-6vvcjy waitedMs=178957 queueAhead=1
[diagnostic] lane wait exceeded: lane=session:agent:main:treehouse-sess-mmthhrfo-6vvcjy waitedMs=91671  queueAhead=1
http: proxy error: context canceled
http: proxy error: context canceled

New messages queue up behind the stuck task and wait forever. The session never recovers without a gateway restart.

Root Cause

In pump() (src/process/command-queue.ts, lines 118-143), each dequeued task is awaited with no timeout protection:

void (async () => {
  try {
    const result = await entry.task();  // <-- no timeout, hangs forever if promise never settles
    const completedCurrentGeneration = completeTask(state, taskId, taskGeneration);
    if (completedCurrentGeneration) {
      pump();  // <-- never reached
    }
    entry.resolve(result);
  } catch (err) {
    // ... also never reached
  }
})();

If entry.task() never resolves or rejects:

  1. completeTask() never runs
  2. activeTaskIds retains the stale task ID
  3. pump() is never called again
  4. Since session lanes are maxConcurrent=1 (hardcoded in getLaneState, lines 67-74), the lane is permanently blocked

The only recovery is resetAllLanes() (lines 251-266), which requires a SIGUSR1 gateway restart. There is no automatic detection, health check, or recovery mechanism for stuck lanes.

How It Happens

Any scenario where an enqueued task's promise hangs:

  • Upstream API call (Anthropic, OpenAI, etc.) hangs without responding or erroring
  • WebSocket connection drops mid-request without clean error propagation
  • AbortSignal from scheduleAbortTimer fires but the underlying HTTP fetch doesn't honor it
  • Unhandled exception path in async task code that prevents the promise from settling

The agent runner's internal timeout (scheduleAbortTimer in run/attempt.ts) only works if the task code checks the abort signal. If the underlying fetch call is hung at the OS/socket level, the abort signal may not terminate it, and the lane queue's await entry.task() remains suspended indefinitely.

Affected Channels

All channels route through the same lane system via enqueueCommandInLane with session lanes (maxConcurrent=1):

  • WhatsApp (web provider)
  • Telegram
  • Discord
  • Webchat/Treehouse
  • Cron jobs (via CommandLane.Cron)

Environment

  • OpenClaw v2026.3.12 (Docker, linux/arm64)
  • Node 22
  • Gateway healthy during incident: {"ok":true,"status":"live"}
  • No OOM, no CPU spike, no rate limiting
  • Concurrent event: WhatsApp health-monitor restarted due to stale-socket at 20:25:19, shortly before the lane jammed

Related Issues

These all share the same underlying pattern: work enters the lane queue and never completes, with no automatic recovery.

Suggested Fix Directions

For maintainer consideration — several approaches could address this, each with trade-offs:

a) Promise.race wrapper in pump() — Race each task against a configurable timeout promise. If the timeout wins, reject the entry, clear activeTaskIds, and call pump(). Simple and targeted, but creates "zombie task" concerns (the original hung promise keeps running in the background).

b) Periodic lane health monitor — A background interval that checks for lanes where activeTaskIds.size > 0 and no progress has been made for N seconds. Could auto-clear stale tasks or trigger resetAllLanes() for just the affected lane. More defensive but adds runtime complexity.

c) Better abort signal propagation — Ensure scheduleAbortTimer actually terminates the underlying HTTP fetch (via AbortController on the fetch call itself, not just the agent-level signal). Fixes the root cause but requires changes deeper in the API call stack.

d) Combination — Defense in depth: fix abort propagation (c) to prevent most hangs, add a queue-level timeout (a) as a safety net, and a health monitor (b) as a last resort.

Open Questions

  • Is the lack of task-level timeout in pump() deliberate? (e.g., to avoid killing legitimately long-running tasks like compaction)
  • What's the expected maximum task duration for session lane work?
  • Would a configurable taskTimeoutMs option on enqueueCommandInLane be acceptable?
  • Should the fix prioritize the queue level, the API call level, or both?

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:not-repro-on-mainClawSweeper found high-confidence evidence that this issue no longer reproduces on main.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦪 silver shellfishThin issue quality; more reproduction proof or environment detail is needed.staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions