Session lane permanently blocked after embedded run timeout during tool calls

## Bug

After an `embedded run timeout` fires during a tool call chain, the session lane remains permanently blocked. The session becomes completely unresponsive — no new messages are processed. The only recovery is creating a new session.

## Root Cause Analysis

Traced through the source code in `extensionAPI.js`:

### 1. Lane concurrency is 1 per session

```javascript
const created = {
    lane,
    queue: [],
    active: 0,
    maxConcurrent: 1,  // only one task at a time
    draining: false
};
```

### 2. The 600s timeout covers the entire run, not individual tool calls

The timeout wraps the full agent run (all API calls + all tool executions combined). With large context (~167k tokens) and a high-latency API proxy, a single API round-trip can take 3-4 minutes. Two or three tool calls easily exceed 600 seconds total.

### 3. Abort doesn't fully clean up session state

When timeout fires:
```javascript
const abortRun = (isTimeout) => {
    aborted = true;
    runAbortController.abort();
    activeSession.abort();  // ← may not complete cleanup
};
```

If the tool call (e.g., `exec` waiting on a subprocess, or `memory_search` doing embeddings) doesn't properly respond to the abort signal, `activeSession.abort()` doesn't fully release the session. The session state remains `"processing"`, and `state.active` in the lane is never decremented back to 0.

### 4. Lane queue has no timeout — tasks wait forever

```javascript
// drainLane() only processes when active < maxConcurrent
// If active is stuck at 1, nothing ever drains
while (state.active < state.maxConcurrent && state.queue.length > 0) { ... }
```

Subsequent messages enqueue but never execute. This produces the `lane wait exceeded` diagnostic warnings seen in logs (up to 205 seconds observed).

## Evidence from Logs

```
[agent/embedded] embedded run timeout: runId=... timeoutMs=600000
[diagnostic] lane wait exceeded: lane=session:agent:main:main waitedMs=205145 queueAhead=0
[diagnostic] lane wait exceeded: lane=session:agent:main:main waitedMs=97423 queueAhead=0
```

Pattern: timeout → lane wait grows unbounded → session never recovers.

Related: context at 167k/200k tokens contributes to slow API calls, but the core bug is the lane not being released after abort.

## Suggested Fix

1. **Ensure `state.active` is decremented in all abort paths** — add a `finally` block or abort handler in the lane task wrapper to guarantee cleanup:
   ```javascript
   try {
       const result = await entry.task();
   } catch (e) {
       // handle error
   } finally {
       state.active -= 1;  // always release the lane slot
       pump();
   }
   ```

2. **Add a lane queue timeout** — tasks waiting longer than a configurable threshold (e.g., 5 minutes) should be rejected rather than waiting forever.

3. **Per-tool-call timeout** — in addition to the run-level timeout, each tool execution should have its own timeout to prevent a single slow tool from consuming the entire budget.

## Environment

- OpenClaw version: 2026.2.2-3
- OS: macOS (Darwin 25.2.0)
- API: Anthropic via proxy
- Model: claude-opus-4-5
- Config: maxConcurrent=4, contextPruning ttl=1h

## Related Issues

- #5865 — AbortError crash from `void activeSession.abort()` (closed, partially fixed the crash but not the lane blocking)
- #3092 — Session lock timeout during long operations (related symptom, different root cause)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Session lane permanently blocked after embedded run timeout during tool calls #9405

Bug

Root Cause Analysis

1. Lane concurrency is 1 per session

2. The 600s timeout covers the entire run, not individual tool calls

3. Abort doesn't fully clean up session state

4. Lane queue has no timeout — tasks wait forever

Evidence from Logs

Suggested Fix

Environment

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Session lane permanently blocked after embedded run timeout during tool calls #9405

Description

Bug

Root Cause Analysis

1. Lane concurrency is 1 per session

2. The 600s timeout covers the entire run, not individual tool calls

3. Abort doesn't fully clean up session state

4. Lane queue has no timeout — tasks wait forever

Evidence from Logs

Suggested Fix

Environment

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions