Skip to content

fix: add task-level timeout to lane queue to prevent permanent session blocking#899

Open
BingqingLyu wants to merge 2 commits into
mainfrom
fork-pr-48690-fix-lane-task-timeout
Open

fix: add task-level timeout to lane queue to prevent permanent session blocking#899
BingqingLyu wants to merge 2 commits into
mainfrom
fork-pr-48690-fix-lane-task-timeout

Conversation

@BingqingLyu

@BingqingLyu BingqingLyu commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add Promise.race timeout wrapper around await entry.task() in pump() to prevent hung promises from permanently jamming session lanes
  • New CommandLaneTaskTimeoutError error class for timed-out tasks
  • New taskTimeoutMs option on enqueueCommandInLane (default: 5 minutes, pass 0 or Infinity to disable)
  • Diagnostic warning logged when a task times out (lane task timed out: lane=... timeoutMs=...)
  • timer.unref() prevents timeout timers from keeping the process alive

Problem

When an enqueued task's promise never settles (hung upstream API call, dropped WebSocket, unhandled exception), completeTask() never runs, pump() is never called again, and the session lane is permanently blocked. Session lanes use maxConcurrent=1, so one stuck task blocks all future messages for that session. The only recovery is a full gateway restart via SIGUSR1 (resetAllLanes()).

This affects all messaging channels (WhatsApp, Telegram, Discord, webchat) and cron jobs. See openclaw#48488 for full root cause analysis with live diagnostic evidence.

Changes

src/process/command-queue.ts:

  • New CommandLaneTaskTimeoutError error class
  • DEFAULT_TASK_TIMEOUT_MS constant (5 minutes)
  • Extended QueueEntry with optional taskTimeoutMs field
  • pump(): race each task against a timeout promise; on timeout, reject the entry, clear activeTaskIds, log warning, and call pump() to unblock
  • Extended enqueueCommandInLane opts with taskTimeoutMs

src/process/command-queue.test.ts:

  • 6 new tests covering: stuck task timeout + lane unblock, fast task completion, custom per-task timeouts, timeout disable (taskTimeoutMs: 0), diagnostic logging, and safe interaction with resetAllLanes generation bumps

Test plan

  • pnpm build passes
  • pnpm test -- src/process/command-queue.test.ts — all 23 tests pass (17 existing + 6 new)
  • pnpm check — lint/format clean
  • Deploy to staging gateway and verify webchat sessions auto-recover from stuck lanes

Closes openclaw#48488
Related: openclaw#42883, openclaw#42960, openclaw#42997, openclaw#29601

🤖 Generated with Claude Code

kyletabor and others added 2 commits March 16, 2026 20:09
…n blocking

When an enqueued task's promise never settles (e.g. hung upstream API
call), the lane is permanently jammed because `pump()` is never called
again. Session lanes use maxConcurrent=1, so one stuck task blocks all
future messages for that session with no automatic recovery — only a
full gateway restart (SIGUSR1) clears the stale state.

Wrap each dequeued task in `Promise.race` against a configurable timeout
(default 5 minutes). When the timeout wins, reject the task with
`CommandLaneTaskTimeoutError`, clean up `activeTaskIds`, log a
diagnostic warning, and call `pump()` to unblock the lane.

Callers can set per-task timeouts via `taskTimeoutMs` on
`enqueueCommandInLane` opts. Pass `0` or `Infinity` to opt out.

Closes openclaw#48488
Related: openclaw#42883, openclaw#42960, openclaw#42997, openclaw#29601

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…efault timeout

- Gate timeout diag.warn on completedCurrentGeneration; stale post-reset
  timeouts downgrade to diag.debug to avoid misleading on-call noise
- Capture activeTaskIds.size before completeTask() removal so the
  timeout warning reports the pre-removal active count
- Increase DEFAULT_TASK_TIMEOUT_MS from 5 to 15 minutes — the lane
  timeout is a last-resort safety net above the agent-level timeout
  (default 600s / 10 min), so it must be higher to avoid killing
  legitimate long-running tasks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lane queue has no task-level timeout — hung promises permanently block session lanes

2 participants