-
-
Notifications
You must be signed in to change notification settings - Fork 56.9k
Description
Bug
After an embedded run timeout fires during a tool call chain, the session lane remains permanently blocked. The session becomes completely unresponsive — no new messages are processed. The only recovery is creating a new session.
Root Cause Analysis
Traced through the source code in extensionAPI.js:
1. Lane concurrency is 1 per session
const created = {
lane,
queue: [],
active: 0,
maxConcurrent: 1, // only one task at a time
draining: false
};2. The 600s timeout covers the entire run, not individual tool calls
The timeout wraps the full agent run (all API calls + all tool executions combined). With large context (~167k tokens) and a high-latency API proxy, a single API round-trip can take 3-4 minutes. Two or three tool calls easily exceed 600 seconds total.
3. Abort doesn't fully clean up session state
When timeout fires:
const abortRun = (isTimeout) => {
aborted = true;
runAbortController.abort();
activeSession.abort(); // ← may not complete cleanup
};If the tool call (e.g., exec waiting on a subprocess, or memory_search doing embeddings) doesn't properly respond to the abort signal, activeSession.abort() doesn't fully release the session. The session state remains "processing", and state.active in the lane is never decremented back to 0.
4. Lane queue has no timeout — tasks wait forever
// drainLane() only processes when active < maxConcurrent
// If active is stuck at 1, nothing ever drains
while (state.active < state.maxConcurrent && state.queue.length > 0) { ... }Subsequent messages enqueue but never execute. This produces the lane wait exceeded diagnostic warnings seen in logs (up to 205 seconds observed).
Evidence from Logs
[agent/embedded] embedded run timeout: runId=... timeoutMs=600000
[diagnostic] lane wait exceeded: lane=session:agent:main:main waitedMs=205145 queueAhead=0
[diagnostic] lane wait exceeded: lane=session:agent:main:main waitedMs=97423 queueAhead=0
Pattern: timeout → lane wait grows unbounded → session never recovers.
Related: context at 167k/200k tokens contributes to slow API calls, but the core bug is the lane not being released after abort.
Suggested Fix
-
Ensure
state.activeis decremented in all abort paths — add afinallyblock or abort handler in the lane task wrapper to guarantee cleanup:try { const result = await entry.task(); } catch (e) { // handle error } finally { state.active -= 1; // always release the lane slot pump(); }
-
Add a lane queue timeout — tasks waiting longer than a configurable threshold (e.g., 5 minutes) should be rejected rather than waiting forever.
-
Per-tool-call timeout — in addition to the run-level timeout, each tool execution should have its own timeout to prevent a single slow tool from consuming the entire budget.
Environment
- OpenClaw version: 2026.2.2-3
- OS: macOS (Darwin 25.2.0)
- API: Anthropic via proxy
- Model: claude-opus-4-5
- Config: maxConcurrent=4, contextPruning ttl=1h
Related Issues
- Crash: unhandled AbortError from void activeSession.abort() on embedded run timeout #5865 — AbortError crash from
void activeSession.abort()(closed, partially fixed the crash but not the lane blocking) - Session lock timeout causes channel handler failures during long operations #3092 — Session lock timeout during long operations (related symptom, different root cause)