Skip to content

[Bug]: Embedded agent session silently hangs after auto-compaction with no error logging or recovery #89051

@ArthurusDent

Description

@ArthurusDent

Bug type

Crash (process/app exits or hangs)

Beta release blocker

No

Summary

After an embedded Python agent triggers auto-compaction due to context overflow, the post-compaction retry enters a silent failure state: no subsequent API call is made, no tool execution occurs, no error is logged, and the session remains stuck indefinitely until manual re-triggering. The gateway log records "auto-compaction succeeded; retrying prompt" but nothing follows — not even an incomplete turn or failed process poll.

Steps to reproduce

  1. Start a Python script that runs as an embedded OpenClaw agent via agentTurn cron job with 72k context window and a reserveTokensFloor of 20000 tokens.
  2. The script performs many iterative tool calls (exec, process.poll, file I/O) over multiple turns, accumulating tool output into the session history.
  3. When estimated prompt tokens exceed the effective budget (contextWindow - reserveTokensFloor ≈ 51680), auto-compaction triggers and truncates old messages.
  4. After compaction completes successfully (logged: auto-compaction succeeded; retrying prompt), the agent resumes normal operation but re-accumulates tool output rapidly.
  5. Within ~14 minutes, a subprocess poll returns exit code 0 with zero output, the agent retries on a different subprocess which also hangs ("still running"), and the next model API response is truncated mid-generation (incomplete thinking block ending with </parameter>).
  6. From that point forward: no JSONL entries, no gateway logs, no tool calls, no messages sent — the session remains in this state until manually re-triggered by the user.

In the observed incident, the Python script was a RemNote-to-Obsidian migration tool processing children of root nodes; the subprocesses were executing python3 migrate_v4.py ... commands against a 456 MB database dump. The crash is reproducible any time an embedded agent with moderate context budget accumulates tool output rapidly, then encounters at least one subprocess that returns empty output or hangs.

Expected behavior

  • After auto-compaction succeeds and the retry loop begins, if a subsequent API response is truncated mid-generation, the embedded-agent-runner should detect the incomplete turn, log the error with full context (including the last valid JSONL entry), and attempt recovery or set a clear failure state.
  • Session state should transition through clearly identifiable stages (activecompactingretryingsuccess/failed) with mandatory logging at each transition.
  • The session file JSONL should never contain structurally incomplete message entries — either validate before writing or discard partial responses.

Actual behavior

  1. Gateway logs show "auto-compaction succeeded; retrying prompt" but absolutely no subsequent log lines during the failure window (14 minutes of silence).
  2. The last JSONL entry is an incomplete thinking block (</parameter> without matching opening tag, no text content, no tool call, no stopReason field) — written despite being structurally invalid.
  3. Session file timestamps show entries up to the truncation point but nothing after; no new model API calls appear in gateway logs.
  4. Matrix messages that had been queued for delivery are flushed in a burst when the session gets stuck.
  5. The session remains stuck until explicitly revived — there is no automatic recovery, no failure notification, and no clear error indication in the dashboard.

OpenClaw version

2026.5.28

Operating system

Ubuntu Server 24.04 LTS (kernel 6.8.0-124-generic x64)

Install method

git clone — updated via cd ~/openclaw && git pull && git fetch origin --tags && git checkout <tag> && pnpm build && openclaw gateway start/stop

Model

llama/Qwen3.6-35B-A3B-UD-MTP-Q4_K_M

Provider / routing chain

local (Ollama / llama.cpp backend — model served locally, no external provider)

Additional provider/model setup details

The embedded agent runs as an agentTurn cron job (isolated session target). The gateway uses default embedded agent configuration with reserveTokensFloor: 20000, giving an effective prompt budget of approximately 51680 tokens for a 72k context window. Auto-compaction is configured to trigger on first overflow attempt (attempt 1/3).

Logs, screenshots, and evidence

Gateway journal entries for `openclaw-gateway` on 2026-05-31 (relevant CEST timestamps):


17:31:59.648Z [SESSION] Context overflow detected - session <room_id>... estimatedPromptTokens=68612 > promptBudgetBeforeReserve=51680 ... attempt 1/3 contextWindow=72000 reserveTokensFloor=20000
17:35:21.400Z [CONTEXT COMPACTOR] Context compaction completed (attempt 1/3) - truncated 48 tool results... remainingMessages=56 estimatedPromptTokens=28985 (reduced by 60%) ... contextWindow=72000 reserveTokensFloor=20000
17:35:21.424Z [AGENT_RUNNER] Auto-compaction succeeded; retrying prompt (attempts left: 2) ... nextAttemptPromptOverride set, suppressNextUserMessagePersistence=true
17:35:21.438Z [POST_COMPACTION_GUARD] Armed for post-compaction loop detection - windowSize=3 maxAttempts=3 maxFailures=5 ... startAtTs=2026-05-31T15:35:21.437Z


**No log lines appear between 17:35:21 and the next user message at 17:51:41.** This is a 14-minute silent window with zero logging activity despite the session being theoretically active.

Session JSONL file analysis (`0745bbf6-1873-41d5-9cf5-e372112862a8.jsonl`):
- 200 messages written between 17:35–17:49 CEST (post-compaction, pre-failure) — approximately 14 turns/minute with growing token count from ~28k to ~41k.
- Last entry at 17:49:35 CEST is an incomplete thinking block with no `stopReason`, no text content, and no tool call — the response was truncated mid-generation.
- No entries exist after this timestamp in the session file.

Session key: `<agent>:<main>:<chat_channel>:<room_id>`

Relevant source code locations:
- `src/agents/embedded-agent-runner/run.ts` (post-compaction retry loop, lines ~2200–2250)
- `src/agents/embedded-agent-runner/post-compaction-loop-guard.ts` (loop detection — only checks content repetition, does not detect silent API failures)

Impact and severity

Affected users/systems: All embedded agent sessions using auto-compaction with moderate context budgets where tool output accumulates rapidly.
Severity: High — sessions become unrecoverable without manual user intervention. Any work-in-progress is lost (the 200 messages between compaction and failure are not persisted to the session state in a recoverable way).
Frequency: Likely occurs whenever an embedded agent performs rapid iterative tool calls after compaction and encounters at least one subprocess hang or API timeout. The combination of these conditions appears non-trivial but plausible for any long-running workflow (code execution, database queries, file processing).
Consequence: Complete loss of agent state between compaction completion and next user message; no error visibility; no automatic recovery; requires manual session restart.

Additional information

The PostCompactionLoopGuard (windowSize=3, maxFailures=5) does not detect this failure because the agent was making progress across different turns — generating new messages each time with varying content. It never entered a content repetition pattern. The guard only detects repetitive output loops, not silent API failures or subprocess hangs.

The MID_TURN_PRECHECK_CONTINUATION_PROMPT mechanism (setting nextAttemptPromptOverride) initiates the retry but provides no error handling path for when that retry itself fails without logging.

Last known good state: auto-compaction completes at 17:35:21 with clear log confirmation. First known bad state: truncated response at 17:49:35 followed by indefinite silence with zero logging. The gap between these points (~14 minutes) suggests the failure is not immediate but accumulates over multiple turns, making it difficult to correlate with a specific gateway event.

Related issues

These were identified during pre-submission review of the OpenClaw issue tracker.

# Title Similarity Why it's related Key difference
#70744 Telegram direct session can become unrecoverable after context overflow and auto-compaction hangs Very high — same post-compaction freeze, requires manual session reset Nearly identical symptom: context overflow → compaction → session becomes non-responsive with no recovery In #70744 compaction itself failed silently; in our case compaction succeeded, the agent worked for 14 min, then crashed
#84777 Compaction causes Pi runtime deadlock — agent freezes across all channels after summary generation High — post-compaction silent failure, requires session rebuild (/new) Same pattern: compaction succeeds → agent goes silent → no error logged → only /new fixes it Different trigger: transcript write failure during in-place rewrite; we succeed through compaction and work for 14 min
#87692 GitHub Copilot anthropic-messages transport silently hangs ~365s in isolated cron sessions High on logging side — exact same "silent abort without error log" pattern, watchdog fires after timeout The core mystery: how does a run disappear from logs entirely? #87692 documents the identical observation (stream opens, no chunks, abort at ~365s, cron reports status: ok) Different mechanism: stream-level watchdog timeout vs. our API response truncation
#87711 Telegram routing/footer issue with empty assistant response → 12 min silence Medium — post-turn silent failure Similar symptom of agent going quiet without explanation, user sees "— out" footer after delay Different phase (normal turn, not post-compaction) and different transport
#86567 Feldbericht: agent executes tools but does not send final answer + context grows too fast → agent hangs Medium — general pattern match without specific timeline Describes the broader "context pressure → slow down/hang" family of bugs without our post-compaction detail No structured evidence or timeline; general community report
#79350 Ollama model stalls after tool results in agent runs while direct chat works Low-Medium — post-tool-call hang with timeouts and aborts Same general symptom (agent hangs mid-workflow) but has explicit error logs and timeout behaviour Opposite of our issue: #79350 has verbose errors; ours is silent

Notes for reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.bugSomething isn't workingbug:crashProcess/app exits unexpectedly or hangsclawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions