Skip to content

Compaction causes gateway to hang, requiring manual restart #13379

@waldo-claw

Description

@waldo-claw

Bug Description

Session gets stuck after compaction triggers, gateway becomes unresponsive. All channels (Control UI, Feishu, new sessions) stop responding. Only gateway restart resolves it.

Environment

  • OpenClaw Version: 2026.2.9
  • Platform: WSL2 (Ubuntu)
  • Gateway Mode: Local (loopback)
  • Model: MiniMax-M2.1
  • Compaction Mode: safeguard

Reproduction Steps

  1. Session runs normally for extended period (multiple hours/days)
  2. Compaction triggers automatically (when context approaches limit)
  3. Gateway becomes completely unresponsive:
    • Control UI: Shows connection error / "(no output)"
    • Feishu: Messages fail to send
    • New sessions: Cannot establish connection
  4. Only workaround: Manual gateway restart (openclaw gateway restart)

Technical Details

Timeline from Gateway Logs

# Compaction started
{"subsystem":"agent/embedded","1":"embedded run compaction start: runId=33d6ae56-5065-45c1-8eae-13b8d669bf8e"}
[Timestamp: 2026-02-10T10:56:53.317Z]

# Compaction retry triggered
{"subsystem":"agent/embedded","1":"embedded run compaction retry: runId=33d6ae56-5065-45c1-8eae-13b8d669bf8e"}
[Timestamp: 2026-02-10T10:57:30.652Z]

# TIMEOUT after 600 seconds (10 minutes)
{"subsystem":"agent/embedded","1":"embedded run timeout: runId=33d6ae56-5065-45c1-8eae-13b8d669bf8e timeoutMs=600000"}
[Timestamp: 2026-02-10T11:04:39.975Z]

Additional Symptoms

After timeout, stale cron job running markers were found:

{"module":"cron","1":{"jobId":"9c06c09b-e9b4-40df-8626-22c43ec0cd37","runningAtMs":1770726600004},"2":"cron: clearing stale running marker on startup"}
{"module":"cron","1":{"jobId":"7d2b3ecd-5e78-4fc3-aeb2-f4d559a033f0","runningAtMs":1770726600004},"2":"cron: clearing stale running marker on startup"}

These markers are cleared on subsequent gateway restart, indicating previous shutdown was unclean.

Impact

  • Severity: High - Complete service disruption
  • User Experience: Gateway completely frozen, requires manual intervention
  • Recovery: Manual gateway restart is the only known workaround
  • Frequency: Reproduced multiple times in the same session

Suggested Investigation Areas

  1. Compaction timeout handling: The 600-second timeout appears to hang rather than gracefully fail
  2. State cleanup: Stale running markers not being cleared during/after timeout
  3. Message queue: Incoming messages not being processed during compaction
  4. Channel reconnection: WebSocket connections not recovering after compaction failure

Workaround

Manual gateway restart:

openclaw gateway restart

Logs

Full gateway logs available at: /tmp/openclaw/openclaw-2026-02-10.log

Note: Logs are overwritten on gateway restart, so capture immediately after reproduction.

Additional Context

This appears to be related to issue #11140 (HEARTBEAT_OK accumulates) as compaction is involved in session context management.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions