Skip to content

[Bug]: Embedded run timeout fails to clean up session/lane state when compaction is in-flight (waitForCompactionRetry not abort-aware) #13341

@smartleos

Description

@smartleos

Summary

When an embedded run times out during the post-reply compaction phase, the timeout handler calls abortRun(true) but the abort signal never reaches waitForCompactionRetry() because it is not wrapped in abortable(). This causes the finally cleanup block to never execute, permanently blocking the affected DM lane, leaking the session in processing state, and leaving a zombie run in the active count.

Reproduction Steps

  1. Configure an agent with compaction.mode: "safeguard" and timeoutSeconds: 300
  2. Send a message via Telegram DM that triggers tool use (e.g., read + edit a file), so the agent takes ~45-60s to respond
  3. The agent completes its reply and starts post-reply compaction, which calls OpenAI Batch API for embeddings
  4. If the Batch API is slow (stuck in validatingin_progress for 2+ minutes), the total run time (agent response + compaction) exceeds the 300s timeout
  5. The timeout fires, but waitForCompactionRetry() blocks forever
  6. All subsequent messages to that DM lane are permanently queued and never processed
  7. Only a gateway restart recovers the channel

Expected Behavior

When the embedded run timeout fires during compaction:

  1. The abort signal should propagate to the compaction wait
  2. clearActiveEmbeddedRun() should be called (session returns to idle)
  3. The lane task should complete (resolve or reject), unblocking the lane
  4. Subsequent messages should be processed normally

Actual Behavior

After timeout:

  • abortRun(true) fires and logs embedded run timeout as WARNING
  • But waitForCompactionRetry() is a bare await (not wrapped in abortable()) — the abort signal never interrupts it
  • The finally block containing clearActiveEmbeddedRun() and unsubscribe() never executes
  • Session state remains processing forever (no session state: prev=processing new=idle log)
  • Run is never cleared (no run cleared log, totalActive keeps the zombie run)
  • Lane task never completes (no lane task done log)
  • The DM lane is permanently blocked — all new messages queue up but never get dequeued

Evidence from logs:

10:13:27 session state: sessionId=ae54138b prev=idle new=processing
10:13:27 run registered: totalActive=1
10:14:13 compaction start (agent already replied)
10:18:27 embedded run timeout (WARNING only)
         ❌ No "session state: prev=processing new=idle"
         ❌ No "run cleared"
         ❌ No "lane task done"
10:34:22 run registered: totalActive=2  ← zombie run still counted

Root Cause

In src/agents/pi-embedded-runner/run/attempt.ts (bundled into loader-*.js, pi-embedded-*.js, reply-*.js, extensionAPI.js):

// Current code (line ~676 in the original attempt.ts):
try {
    await waitForCompactionRetry();        // ← NOT wrapped in abortable()
} catch (err) {
    if (isAbortError(err)) {
        if (!promptError) promptError = err;
    } else throw err;
}

The abortable() helper is already defined in the same scope and used for activeSession.prompt(). The fix is to wrap waitForCompactionRetry() the same way:

// Fixed:
try {
    await abortable(waitForCompactionRetry());  // ← now abort-aware
} catch (err) {
    if (isAbortError(err)) {
        if (!promptError) promptError = err;
    } else throw err;
}

This ensures:

  • If abort already fired → abortable() immediately rejects (signal already aborted)
  • If abort fires during wait → the abort listener rejects the wrapper promise
  • Either way, the finally block runs → clearActiveEmbeddedRun() + unsubscribe() execute → session and lane are properly cleaned up

Environment

  • OpenClaw version: Reproduced on 2026.2.1, confirmed still present in 2026.2.9 (all 4 bundle files)
  • OS: Ubuntu 24.04 on Azure VM (Linux 6.14.0-1017-azure)
  • Installation: npm install -g openclaw
  • Channel: Telegram DM
  • Agent model: github-copilot/claude-sonnet-4.5 (but model-independent — any model can trigger this)
  • Compaction mode: safeguard

Workaround

Restart the gateway to clear the stuck in-memory state:

systemctl --user restart openclaw-gateway

Alternatively, manually patch await waitForCompactionRetry()await abortable(waitForCompactionRetry()) in all 4 dist bundle files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions