Skip to content

Session lane permanently blocked after embedded run timeout during tool calls #9405

@TheSoulGiver

Description

@TheSoulGiver

Bug

After an embedded run timeout fires during a tool call chain, the session lane remains permanently blocked. The session becomes completely unresponsive — no new messages are processed. The only recovery is creating a new session.

Root Cause Analysis

Traced through the source code in extensionAPI.js:

1. Lane concurrency is 1 per session

const created = {
    lane,
    queue: [],
    active: 0,
    maxConcurrent: 1,  // only one task at a time
    draining: false
};

2. The 600s timeout covers the entire run, not individual tool calls

The timeout wraps the full agent run (all API calls + all tool executions combined). With large context (~167k tokens) and a high-latency API proxy, a single API round-trip can take 3-4 minutes. Two or three tool calls easily exceed 600 seconds total.

3. Abort doesn't fully clean up session state

When timeout fires:

const abortRun = (isTimeout) => {
    aborted = true;
    runAbortController.abort();
    activeSession.abort();  // ← may not complete cleanup
};

If the tool call (e.g., exec waiting on a subprocess, or memory_search doing embeddings) doesn't properly respond to the abort signal, activeSession.abort() doesn't fully release the session. The session state remains "processing", and state.active in the lane is never decremented back to 0.

4. Lane queue has no timeout — tasks wait forever

// drainLane() only processes when active < maxConcurrent
// If active is stuck at 1, nothing ever drains
while (state.active < state.maxConcurrent && state.queue.length > 0) { ... }

Subsequent messages enqueue but never execute. This produces the lane wait exceeded diagnostic warnings seen in logs (up to 205 seconds observed).

Evidence from Logs

[agent/embedded] embedded run timeout: runId=... timeoutMs=600000
[diagnostic] lane wait exceeded: lane=session:agent:main:main waitedMs=205145 queueAhead=0
[diagnostic] lane wait exceeded: lane=session:agent:main:main waitedMs=97423 queueAhead=0

Pattern: timeout → lane wait grows unbounded → session never recovers.

Related: context at 167k/200k tokens contributes to slow API calls, but the core bug is the lane not being released after abort.

Suggested Fix

  1. Ensure state.active is decremented in all abort paths — add a finally block or abort handler in the lane task wrapper to guarantee cleanup:

    try {
        const result = await entry.task();
    } catch (e) {
        // handle error
    } finally {
        state.active -= 1;  // always release the lane slot
        pump();
    }
  2. Add a lane queue timeout — tasks waiting longer than a configurable threshold (e.g., 5 minutes) should be rejected rather than waiting forever.

  3. Per-tool-call timeout — in addition to the run-level timeout, each tool execution should have its own timeout to prevent a single slow tool from consuming the entire budget.

Environment

  • OpenClaw version: 2026.2.2-3
  • OS: macOS (Darwin 25.2.0)
  • API: Anthropic via proxy
  • Model: claude-opus-4-5
  • Config: maxConcurrent=4, contextPruning ttl=1h

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions