Skip to content

Bug: Compaction causes Pi runtime deadlock — agent freezes across all channels after summary generation #84777

@villa-feng

Description

@villa-feng

Summary

Compaction causes Pi runtime deadlock after summary generation, freezing ALL channels of the affected agent. Gateway remains healthy (no crash, no error log), but the agent stops responding to any channel until session is rebuilt.

OpenClaw 2026.5.18 (50a2481) · macOS 15.6 · Node 23.11.0 · Apple Silicon

Reproduction Pattern (3 days, 4 occurrences)

  1. Agent session accumulates tokens to ~182k/262k (~70%) — compaction threshold determined by reserveTokensFloor: 80000
  2. Auto-compaction triggers (or manual /compact)
  3. Compaction summary is generated successfully (verified in transcript)
  4. .reset backup created
  5. Post-compaction: no new messages written to transcript — session goes silent
  6. All channels of the same agent freeze (confirmed: Feishu + WeCom both unresponsive)
  7. Other agents on same gateway unaffected
  8. /compact returns "skipped: session was already compacted recently"
  9. Gateway restart does NOT recover — agent still unresponsive
  10. Only /new (fresh session) restores function

Timeline (Latest Incident)

All times UTC+8 (Beijing):

Time Event
10:54 Gateway auto-restarted by launchd (kickstart)
10:58 User message processed normally
11:00 Auto-compaction triggered (182,767 tokens)
11:00 Compaction summary generated (comprehensive, well-structured)
11:00 .reset backup created (1.5MB, 570 lines)
11:00+ No new messages in transcript — deadlock
11:05 Session marked as reset
11:08 User rebuilt session → new session works

Evidence

Compaction entry in transcript (last entry before deadlock)

{
  "type": "compaction",
  "timestamp": "2026-05-21T03:00:56.816Z",
  "summary": "## Goal\n- Investigate...",
  "tokensBefore": 182767,
  "fromHook": false
}

Summary was well-formed with Goal, Progress, Next Steps, read-files, modified-files — quality is fine.

File state after deadlock

Session directory contains:

  • xxx.jsonl.reset.<timestamp> — backup created at reset (1.5MB)
  • xxx.checkpoint.<uuid>.jsonl — pre-compaction checkpoint (611KB)
  • xxx.trajectory.jsonl — full trajectory (10MB)
  • xxx.trajectory-path.json — pointer

Missing: No compacted .json successor file was ever created.

Multi-channel confirmation

When Feishu froze, WeCom channel of the same agent also stopped responding within minutes. A different agent on the same gateway continued working normally, confirming the deadlock is agent-scoped, not gateway-scoped.

Gateway health

  • gateway.err.log: zero errors for the incident day
  • gateway.log: stopped writing on May 19 (2 days before incident) — log rotation or logging bug
  • Gateway process: healthy, no crash
  • Other agents: fully functional

Configuration Context

{
  "agents": {
    "defaults": {
      "compaction": {
        "reserveTokensFloor": 80000,
        "midTurnPrecheck": { "enabled": true }
      }
    }
  }
}

Key: reserveTokensFloor: 80000 on a 262k context window → compaction triggers at ~70% (182k tokens), much earlier than default (24k reserve → ~91% trigger).

truncateAfterCompaction was not set (default false) — in-place rewrite mode.
notifyUser was not set (default false).

Hypothesis

Compaction summary generation succeeds, but the subsequent transcript write/rotation step fails silently. Since truncateAfterCompaction is false, OpenClaw uses in-place transcript rewrite. The failure leaves the Pi runtime's event loop in an inconsistent state — an async file operation doesn't resolve, blocking the entire agent's message processing queue. This would explain:

  1. Agent-level deadlock (Pi runtime blocked, not gateway)
  2. No gateway errors (the event loop is stuck, not crashed)
  3. Gateway restart doesn't help (the broken session state persists on disk)
  4. /new fixes it (creates fresh Pi runtime + fresh transcript)

The reserveTokensFloor: 80000 (causing frequent early compactions) and gateway restart shortly before compaction may be contributing factors — restart may leave session state slightly inconsistent when the next auto-compaction fires.

Workaround Applied

  • reserveTokensFloor: 80000 → 24000 (default)
  • truncateAfterCompaction: false → true
  • notifyUser: false → true

Related

  • Model: deepseek-v4-pro (262k context)
  • Previous occurrence: same pattern observed on May 19 and May 20
  • Session reset files preserved for debugging if needed

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions