Skip to content

Session lane can deadlock indefinitely after LLM timeout, blocking all subsequent messages #7630

@13004545

Description

@13004545

Bug Description

After an LLM request times out (180s), the session lane can get stuck indefinitely. Subsequent messages only enqueue but never get processed (dequeue). This requires a gateway restart to recover.

Steps to Reproduce

  1. Send a message that triggers a long-running LLM request
  2. Wait for the request to timeout (180s) with FailoverError: LLM request timed out.
  3. Immediately after the timeout, if certain hooks are running (e.g., /new command with slug generation), the next task in the session lane starts but never completes
  4. All subsequent messages pile up in the queue indefinitely

Log Evidence

12:31:14 lane task error: lane=session:qq:dm:... error="FailoverError: LLM request timed out."
12:31:14 lane enqueue: lane=session:qq:dm:... queueSize=1
12:31:14 lane dequeue: lane=session:qq:dm:... queueSize=0
# ^^^ This dequeued task NEVER completes - no "lane task done" or "lane task error"

12:37:14 lane enqueue: lane=session:qq:dm:... queueSize=2
12:38:21 lane enqueue: lane=session:qq:dm:... queueSize=3
12:38:32 lane enqueue: lane=session:qq:dm:... queueSize=4
12:42:32 lane enqueue: lane=session:qq:dm:... queueSize=5
# Queue keeps growing, no dequeue ever happens again

Meanwhile, other lanes (cron, etc.) continue working normally, showing this is a per-session deadlock, not a global hang.

Root Cause Analysis

The nested queue pattern in run.js:

return enqueueSession(() => enqueueGlobal(async () => { ... }));

If the inner task (after session lane dequeue) encounters an unhandled exception or a Promise that never resolves, the session lane's active count is never decremented, blocking all subsequent messages.

Environment

  • Version: 2026.1.24-3
  • Channel: QQ (custom plugin)
  • OS: macOS

Workaround

Added a 5-minute timeout to command-queue.js that forcibly releases the lane if a task doesn't complete:

const LANE_TASK_TIMEOUT_MS = 5 * 60 * 1000;
// ... timeout logic that calls state.active -= 1 and pump() after timeout

Suggested Fix

  1. Add a built-in timeout for lane tasks (configurable)
  2. Investigate why certain message processing chains (especially /new command hooks) can leave tasks in a hung state
  3. Consider adding a lane health check that detects and recovers from stuck lanes

Labels: bug, reliability


Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions