Lane queue has no task-level timeout — hung promises permanently block session lanes

## Summary

Session lanes in the gateway's command queue (`src/process/command-queue.ts`) have no task-level timeout. If an enqueued task's promise never settles, the lane is permanently jammed with no automatic recovery. This affects all messaging channels and cron.

## Symptom

Webchat session stops responding permanently. Gateway is healthy (memory, CPU, `/health` all normal), but the session lane is dead:

```
[diagnostic] lane wait exceeded: lane=session:agent:main:treehouse-sess-mmthhrfo-6vvcjy waitedMs=215576 queueAhead=2
[diagnostic] lane wait exceeded: lane=session:agent:main:treehouse-sess-mmthhrfo-6vvcjy waitedMs=178957 queueAhead=1
[diagnostic] lane wait exceeded: lane=session:agent:main:treehouse-sess-mmthhrfo-6vvcjy waitedMs=91671  queueAhead=1
http: proxy error: context canceled
http: proxy error: context canceled
```

New messages queue up behind the stuck task and wait forever. The session never recovers without a gateway restart.

## Root Cause

In `pump()` (`src/process/command-queue.ts`, lines 118-143), each dequeued task is awaited with no timeout protection:

```typescript
void (async () => {
  try {
    const result = await entry.task();  // <-- no timeout, hangs forever if promise never settles
    const completedCurrentGeneration = completeTask(state, taskId, taskGeneration);
    if (completedCurrentGeneration) {
      pump();  // <-- never reached
    }
    entry.resolve(result);
  } catch (err) {
    // ... also never reached
  }
})();
```

If `entry.task()` never resolves or rejects:
1. `completeTask()` never runs
2. `activeTaskIds` retains the stale task ID
3. `pump()` is never called again
4. Since session lanes are `maxConcurrent=1` (hardcoded in `getLaneState`, lines 67-74), the lane is permanently blocked

The only recovery is `resetAllLanes()` (lines 251-266), which requires a SIGUSR1 gateway restart. There is no automatic detection, health check, or recovery mechanism for stuck lanes.

## How It Happens

Any scenario where an enqueued task's promise hangs:
- Upstream API call (Anthropic, OpenAI, etc.) hangs without responding or erroring
- WebSocket connection drops mid-request without clean error propagation
- `AbortSignal` from `scheduleAbortTimer` fires but the underlying HTTP fetch doesn't honor it
- Unhandled exception path in async task code that prevents the promise from settling

The agent runner's internal timeout (`scheduleAbortTimer` in `run/attempt.ts`) only works if the task code checks the abort signal. If the underlying fetch call is hung at the OS/socket level, the abort signal may not terminate it, and the lane queue's `await entry.task()` remains suspended indefinitely.

## Affected Channels

All channels route through the same lane system via `enqueueCommandInLane` with session lanes (`maxConcurrent=1`):
- WhatsApp (web provider)
- Telegram
- Discord
- Webchat/Treehouse
- Cron jobs (via `CommandLane.Cron`)

## Environment

- OpenClaw v2026.3.12 (Docker, linux/arm64)
- Node 22
- Gateway healthy during incident: `{"ok":true,"status":"live"}`
- No OOM, no CPU spike, no rate limiting
- Concurrent event: WhatsApp health-monitor restarted due to `stale-socket` at 20:25:19, shortly before the lane jammed

## Related Issues

- #42883 — Cron flows break after v2026.3.8 upgrade (5+ users, multiple OSes/providers)
- #42960 — Cron enqueued but never executes (still open)
- #42997 — Manual cron run enqueues but idle
- #29601 — `cron run` always times out at 30s
- #15623 — Session write lock race condition under concurrency (closed as stale, never fixed)

These all share the same underlying pattern: work enters the lane queue and never completes, with no automatic recovery.

## Suggested Fix Directions

For maintainer consideration — several approaches could address this, each with trade-offs:

**a) `Promise.race` wrapper in `pump()`** — Race each task against a configurable timeout promise. If the timeout wins, reject the entry, clear `activeTaskIds`, and call `pump()`. Simple and targeted, but creates "zombie task" concerns (the original hung promise keeps running in the background).

**b) Periodic lane health monitor** — A background interval that checks for lanes where `activeTaskIds.size > 0` and no progress has been made for N seconds. Could auto-clear stale tasks or trigger `resetAllLanes()` for just the affected lane. More defensive but adds runtime complexity.

**c) Better abort signal propagation** — Ensure `scheduleAbortTimer` actually terminates the underlying HTTP fetch (via `AbortController` on the fetch call itself, not just the agent-level signal). Fixes the root cause but requires changes deeper in the API call stack.

**d) Combination** — Defense in depth: fix abort propagation (c) to prevent most hangs, add a queue-level timeout (a) as a safety net, and a health monitor (b) as a last resort.

## Open Questions

- Is the lack of task-level timeout in `pump()` deliberate? (e.g., to avoid killing legitimately long-running tasks like compaction)
- What's the expected maximum task duration for session lane work?
- Would a configurable `taskTimeoutMs` option on `enqueueCommandInLane` be acceptable?
- Should the fix prioritize the queue level, the API call level, or both?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lane queue has no task-level timeout — hung promises permanently block session lanes #48488

Summary

Symptom

Root Cause

How It Happens

Affected Channels

Environment

Related Issues

Suggested Fix Directions

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Lane queue has no task-level timeout — hung promises permanently block session lanes #48488

Description

Summary

Symptom

Root Cause

How It Happens

Affected Channels

Environment

Related Issues

Suggested Fix Directions

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions