[Bug]: Embedded agent session silently hangs after auto-compaction with no error logging or recovery

### Bug type

Crash (process/app exits or hangs)

### Beta release blocker

No

### Summary

After an embedded Python agent triggers auto-compaction due to context overflow, the post-compaction retry enters a silent failure state: no subsequent API call is made, no tool execution occurs, no error is logged, and the session remains stuck indefinitely until manual re-triggering. The gateway log records "auto-compaction succeeded; retrying prompt" but nothing follows — not even an incomplete turn or failed process poll.

### Steps to reproduce

1. Start a Python script that runs as an embedded OpenClaw agent via `agentTurn` cron job with 72k context window and a `reserveTokensFloor` of 20000 tokens.
2. The script performs many iterative tool calls (exec, process.poll, file I/O) over multiple turns, accumulating tool output into the session history.
3. When estimated prompt tokens exceed the effective budget (`contextWindow - reserveTokensFloor ≈ 51680`), auto-compaction triggers and truncates old messages.
4. After compaction completes successfully (logged: `auto-compaction succeeded; retrying prompt`), the agent resumes normal operation but re-accumulates tool output rapidly.
5. Within ~14 minutes, a subprocess poll returns `exit code 0` with zero output, the agent retries on a different subprocess which also hangs ("still running"), and the next model API response is truncated mid-generation (incomplete thinking block ending with `</parameter>`).
6. From that point forward: no JSONL entries, no gateway logs, no tool calls, no messages sent — the session remains in this state until manually re-triggered by the user.

In the observed incident, the Python script was a RemNote-to-Obsidian migration tool processing children of root nodes; the subprocesses were executing `python3 migrate_v4.py ...` commands against a 456 MB database dump. The crash is reproducible any time an embedded agent with moderate context budget accumulates tool output rapidly, then encounters at least one subprocess that returns empty output or hangs.


### Expected behavior

- After auto-compaction succeeds and the retry loop begins, if a subsequent API response is truncated mid-generation, the `embedded-agent-runner` should detect the incomplete turn, log the error with full context (including the last valid JSONL entry), and attempt recovery or set a clear failure state.
- Session state should transition through clearly identifiable stages (`active` → `compacting` → `retrying` → `success/failed`) with mandatory logging at each transition.
- The session file JSONL should never contain structurally incomplete message entries — either validate before writing or discard partial responses.

### Actual behavior

1. Gateway logs show "auto-compaction succeeded; retrying prompt" but absolutely no subsequent log lines during the failure window (14 minutes of silence).
2. The last JSONL entry is an incomplete thinking block (`</parameter>` without matching opening tag, no text content, no tool call, no `stopReason` field) — written despite being structurally invalid.
3. Session file timestamps show entries up to the truncation point but nothing after; no new model API calls appear in gateway logs.
4. Matrix messages that had been queued for delivery are flushed in a burst when the session gets stuck.
5. The session remains stuck until explicitly revived — there is no automatic recovery, no failure notification, and no clear error indication in the dashboard.

### OpenClaw version

2026.5.28

### Operating system

Ubuntu Server 24.04 LTS (kernel 6.8.0-124-generic x64)

### Install method

git clone — updated via `cd ~/openclaw && git pull && git fetch origin --tags && git checkout <tag> && pnpm build && openclaw gateway start/stop`

### Model

llama/Qwen3.6-35B-A3B-UD-MTP-Q4_K_M

### Provider / routing chain

local (Ollama / llama.cpp backend — model served locally, no external provider)

### Additional provider/model setup details

The embedded agent runs as an `agentTurn` cron job (isolated session target). The gateway uses default embedded agent configuration with `reserveTokensFloor: 20000`, giving an effective prompt budget of approximately 51680 tokens for a 72k context window. Auto-compaction is configured to trigger on first overflow attempt (attempt 1/3).

### Logs, screenshots, and evidence

```shell
Gateway journal entries for `openclaw-gateway` on 2026-05-31 (relevant CEST timestamps):


17:31:59.648Z [SESSION] Context overflow detected - session <room_id>... estimatedPromptTokens=68612 > promptBudgetBeforeReserve=51680 ... attempt 1/3 contextWindow=72000 reserveTokensFloor=20000
17:35:21.400Z [CONTEXT COMPACTOR] Context compaction completed (attempt 1/3) - truncated 48 tool results... remainingMessages=56 estimatedPromptTokens=28985 (reduced by 60%) ... contextWindow=72000 reserveTokensFloor=20000
17:35:21.424Z [AGENT_RUNNER] Auto-compaction succeeded; retrying prompt (attempts left: 2) ... nextAttemptPromptOverride set, suppressNextUserMessagePersistence=true
17:35:21.438Z [POST_COMPACTION_GUARD] Armed for post-compaction loop detection - windowSize=3 maxAttempts=3 maxFailures=5 ... startAtTs=2026-05-31T15:35:21.437Z


**No log lines appear between 17:35:21 and the next user message at 17:51:41.** This is a 14-minute silent window with zero logging activity despite the session being theoretically active.

Session JSONL file analysis (`0745bbf6-1873-41d5-9cf5-e372112862a8.jsonl`):
- 200 messages written between 17:35–17:49 CEST (post-compaction, pre-failure) — approximately 14 turns/minute with growing token count from ~28k to ~41k.
- Last entry at 17:49:35 CEST is an incomplete thinking block with no `stopReason`, no text content, and no tool call — the response was truncated mid-generation.
- No entries exist after this timestamp in the session file.

Session key: `<agent>:<main>:<chat_channel>:<room_id>`

Relevant source code locations:
- `src/agents/embedded-agent-runner/run.ts` (post-compaction retry loop, lines ~2200–2250)
- `src/agents/embedded-agent-runner/post-compaction-loop-guard.ts` (loop detection — only checks content repetition, does not detect silent API failures)
```

### Impact and severity

Affected users/systems: All embedded agent sessions using auto-compaction with moderate context budgets where tool output accumulates rapidly.
Severity: High — sessions become unrecoverable without manual user intervention. Any work-in-progress is lost (the 200 messages between compaction and failure are not persisted to the session state in a recoverable way).
Frequency: Likely occurs whenever an embedded agent performs rapid iterative tool calls after compaction and encounters at least one subprocess hang or API timeout. The combination of these conditions appears non-trivial but plausible for any long-running workflow (code execution, database queries, file processing).
Consequence: Complete loss of agent state between compaction completion and next user message; no error visibility; no automatic recovery; requires manual session restart.

### Additional information

The `PostCompactionLoopGuard` (windowSize=3, maxFailures=5) does not detect this failure because the agent was making progress across different turns — generating new messages each time with varying content. It never entered a content repetition pattern. The guard only detects repetitive output loops, not silent API failures or subprocess hangs.

The `MID_TURN_PRECHECK_CONTINUATION_PROMPT` mechanism (setting `nextAttemptPromptOverride`) initiates the retry but provides no error handling path for when that retry itself fails without logging.

Last known good state: auto-compaction completes at 17:35:21 with clear log confirmation. First known bad state: truncated response at 17:49:35 followed by indefinite silence with zero logging. The gap between these points (~14 minutes) suggests the failure is not immediate but accumulates over multiple turns, making it difficult to correlate with a specific gateway event.

## Related issues

These were identified during pre-submission review of the OpenClaw issue tracker. 

| # | Title | Similarity | Why it's related | Key difference |
|---|-------|------------|------------------|----------------|
| **#70744** | Telegram direct session can become unrecoverable after context overflow and auto-compaction hangs | **Very high** — same post-compaction freeze, requires manual session reset | Nearly identical symptom: context overflow → compaction → session becomes non-responsive with no recovery | In #70744 compaction *itself* failed silently; in our case compaction succeeded, the agent worked for 14 min, then crashed |
| **#84777** | Compaction causes Pi runtime deadlock — agent freezes across all channels after summary generation | **High** — post-compaction silent failure, requires session rebuild (`/new`) | Same pattern: compaction succeeds → agent goes silent → no error logged → only `/new` fixes it | Different trigger: transcript write failure *during* in-place rewrite; we succeed through compaction and work for 14 min |
| **#87692** | GitHub Copilot anthropic-messages transport silently hangs ~365s in isolated cron sessions | **High on logging side** — exact same "silent abort without error log" pattern, watchdog fires after timeout | The core mystery: how does a run disappear from logs entirely? #87692 documents the identical observation (stream opens, no chunks, abort at ~365s, cron reports `status: ok`) | Different mechanism: stream-level watchdog timeout vs. our API response truncation |
| **#87711** | Telegram routing/footer issue with empty assistant response → 12 min silence | **Medium** — post-turn silent failure | Similar symptom of agent going quiet without explanation, user sees "— out" footer after delay | Different phase (normal turn, not post-compaction) and different transport |
| **#86567** | Feldbericht: agent executes tools but does not send final answer + context grows too fast → agent hangs | **Medium** — general pattern match without specific timeline | Describes the broader "context pressure → slow down/hang" family of bugs without our post-compaction detail | No structured evidence or timeline; general community report |
| **#79350** | Ollama model stalls after tool results in agent runs while direct chat works | **Low-Medium** — post-tool-call hang with timeouts and aborts | Same general symptom (agent hangs mid-workflow) but has explicit error logs and timeout behaviour | Opposite of our issue: #79350 has verbose errors; ours is silent |

## Notes for reference

- **#70744** was closed in version 2026.4.14 — our incident is on 2026.5.28, suggesting either a regression or a different sub-path that wasn't covered.
- **#84777** and **#87692** are both still open, indicating this family of problems remains unresolved upstream.
- The combination we observed — *post-compaction success → agent operates normally for minutes → silent failure with zero logging* — does not appear to be covered by any existing issue. Our case may represent a distinct sub-path within the compaction/retry mechanism that none of these address completely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Embedded agent session silently hangs after auto-compaction with no error logging or recovery #89051

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Related issues

Notes for reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Title	Similarity	Why it's related	Key difference
#70744	Telegram direct session can become unrecoverable after context overflow and auto-compaction hangs	Very high — same post-compaction freeze, requires manual session reset	Nearly identical symptom: context overflow → compaction → session becomes non-responsive with no recovery	In #70744 compaction itself failed silently; in our case compaction succeeded, the agent worked for 14 min, then crashed
#84777	Compaction causes Pi runtime deadlock — agent freezes across all channels after summary generation	High — post-compaction silent failure, requires session rebuild (`/new`)	Same pattern: compaction succeeds → agent goes silent → no error logged → only `/new` fixes it	Different trigger: transcript write failure during in-place rewrite; we succeed through compaction and work for 14 min
#87692	GitHub Copilot anthropic-messages transport silently hangs ~365s in isolated cron sessions	High on logging side — exact same "silent abort without error log" pattern, watchdog fires after timeout	The core mystery: how does a run disappear from logs entirely? #87692 documents the identical observation (stream opens, no chunks, abort at ~365s, cron reports `status: ok`)	Different mechanism: stream-level watchdog timeout vs. our API response truncation
#87711	Telegram routing/footer issue with empty assistant response → 12 min silence	Medium — post-turn silent failure	Similar symptom of agent going quiet without explanation, user sees "— out" footer after delay	Different phase (normal turn, not post-compaction) and different transport
#86567	Feldbericht: agent executes tools but does not send final answer + context grows too fast → agent hangs	Medium — general pattern match without specific timeline	Describes the broader "context pressure → slow down/hang" family of bugs without our post-compaction detail	No structured evidence or timeline; general community report
#79350	Ollama model stalls after tool results in agent runs while direct chat works	Low-Medium — post-tool-call hang with timeouts and aborts	Same general symptom (agent hangs mid-workflow) but has explicit error logs and timeout behaviour	Opposite of our issue: #79350 has verbose errors; ours is silent

Uh oh!

[Bug]: Embedded agent session silently hangs after auto-compaction with no error logging or recovery #89051

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Related issues

Notes for reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions