-
-
Notifications
You must be signed in to change notification settings - Fork 56.8k
Description
Summary
When an embedded run times out during the post-reply compaction phase, the timeout handler calls abortRun(true) but the abort signal never reaches waitForCompactionRetry() because it is not wrapped in abortable(). This causes the finally cleanup block to never execute, permanently blocking the affected DM lane, leaking the session in processing state, and leaving a zombie run in the active count.
Reproduction Steps
- Configure an agent with
compaction.mode: "safeguard"andtimeoutSeconds: 300 - Send a message via Telegram DM that triggers tool use (e.g., read + edit a file), so the agent takes ~45-60s to respond
- The agent completes its reply and starts post-reply compaction, which calls OpenAI Batch API for embeddings
- If the Batch API is slow (stuck in
validating→in_progressfor 2+ minutes), the total run time (agent response + compaction) exceeds the 300s timeout - The timeout fires, but
waitForCompactionRetry()blocks forever - All subsequent messages to that DM lane are permanently queued and never processed
- Only a gateway restart recovers the channel
Expected Behavior
When the embedded run timeout fires during compaction:
- The abort signal should propagate to the compaction wait
clearActiveEmbeddedRun()should be called (session returns toidle)- The lane task should complete (resolve or reject), unblocking the lane
- Subsequent messages should be processed normally
Actual Behavior
After timeout:
abortRun(true)fires and logsembedded run timeoutas WARNING- But
waitForCompactionRetry()is a bareawait(not wrapped inabortable()) — the abort signal never interrupts it - The
finallyblock containingclearActiveEmbeddedRun()andunsubscribe()never executes - Session state remains
processingforever (nosession state: prev=processing new=idlelog) - Run is never cleared (no
run clearedlog,totalActivekeeps the zombie run) - Lane task never completes (no
lane task donelog) - The DM lane is permanently blocked — all new messages queue up but never get dequeued
Evidence from logs:
10:13:27 session state: sessionId=ae54138b prev=idle new=processing
10:13:27 run registered: totalActive=1
10:14:13 compaction start (agent already replied)
10:18:27 embedded run timeout (WARNING only)
❌ No "session state: prev=processing new=idle"
❌ No "run cleared"
❌ No "lane task done"
10:34:22 run registered: totalActive=2 ← zombie run still counted
Root Cause
In src/agents/pi-embedded-runner/run/attempt.ts (bundled into loader-*.js, pi-embedded-*.js, reply-*.js, extensionAPI.js):
// Current code (line ~676 in the original attempt.ts):
try {
await waitForCompactionRetry(); // ← NOT wrapped in abortable()
} catch (err) {
if (isAbortError(err)) {
if (!promptError) promptError = err;
} else throw err;
}The abortable() helper is already defined in the same scope and used for activeSession.prompt(). The fix is to wrap waitForCompactionRetry() the same way:
// Fixed:
try {
await abortable(waitForCompactionRetry()); // ← now abort-aware
} catch (err) {
if (isAbortError(err)) {
if (!promptError) promptError = err;
} else throw err;
}This ensures:
- If abort already fired →
abortable()immediately rejects (signal already aborted) - If abort fires during wait → the abort listener rejects the wrapper promise
- Either way, the
finallyblock runs →clearActiveEmbeddedRun()+unsubscribe()execute → session and lane are properly cleaned up
Environment
- OpenClaw version: Reproduced on 2026.2.1, confirmed still present in 2026.2.9 (all 4 bundle files)
- OS: Ubuntu 24.04 on Azure VM (Linux 6.14.0-1017-azure)
- Installation:
npm install -g openclaw - Channel: Telegram DM
- Agent model:
github-copilot/claude-sonnet-4.5(but model-independent — any model can trigger this) - Compaction mode:
safeguard
Workaround
Restart the gateway to clear the stuck in-memory state:
systemctl --user restart openclaw-gatewayAlternatively, manually patch await waitForCompactionRetry() → await abortable(waitForCompactionRetry()) in all 4 dist bundle files.