Skip to content

Gateway becomes completely unresponsive after compaction triggers #76467

@njuboy11

Description

@njuboy11

Bug Report: Gateway becomes completely unresponsive after compaction triggers

Issue Description

After a compaction event fires on the main session, the Gateway stops responding to all messages (webchat). Messages are sent but receive no reply. Gateway restart is required to recover.

Environment:

  • Platform: Ubuntu (VM on Tencent Cloud)
  • OpenClaw version: v2026.5.2
  • Node version: v24.14.1
  • Memory: 3.7GB total
  • Webchat channel (direct conversation, not Feishu)

Compaction config:

"compaction": {
  "truncateAfterCompaction": true,
  "maxActiveTranscriptBytes": "10mb"
}

Steps to Reproduce

  1. Main session runs for an extended period with significant conversation history
  2. Compaction triggers (context overflow or byte threshold reached)
  3. User sends a message via webchat after the system prompt about compaction appears
  4. Gateway becomes completely unresponsive — no reply, no error, no feedback
  5. Only Gateway restart (systemctl --user restart openclaw-gateway) restores functionality

Actual Behavior

After compaction triggers:

  • Compaction itself completes successfully (seen in logs: [compaction] rotated active transcript after compaction in ~30 seconds)
  • But subsequent messages get no response
  • Session shows state=processing queueDepth=1 reason=queued_behind_active_work for extended periods
  • Log shows agent cleanup timed out events
  • Liveness warnings show high eventLoopDelayMaxMs values (up to 3454ms)
  • WebSocket connections (sessions.list, chat.history) continue to work for other sessions but main session is stuck

Expected Behavior

Messages sent during/after compaction should either:

  • Be processed after compaction completes, or
  • Return an error message indicating the session is busy with compaction

Log Evidence

# Stuck session during compaction window
12:36:04 long-running session: sessionId=main sessionKey=agent:main:main state=processing age=125s queueDepth=1 reason=queued_behind_active_work classification=long_running
12:36:34 long-running session: sessionId=main sessionKey=agent:main:main state=processing age=155s queueDepth=1 reason=queued_behind_active_work
12:37:21 [compaction] rotated active transcript after compaction (sessionKey=agent:main:main)
12:39:37 long-running session: sessionId=main sessionKey=agent:main:main state=processing age=135s queueDepth=0 reason=active_work classification=long_running

# Cleanup timeouts during the stuck period
12:30:36 agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=
12:33:05 agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=

# Event loop delays
12:21:55 liveness warning: reasons=event_loop_delay interval=31s eventLoopDelayP99Ms=897.1 eventLoopDelayMaxMs=1784.7
12:22:55 liveness warning: reasons=event_loop_delay,cpu interval=30s eventLoopDelayP99Ms=1891.6 eventLoopDelayMaxMs=3454 eventLoopUtilization=0.852
12:24:56 liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=30s eventLoopDelayP99Ms=3259 eventLoopDelayMaxMs=3259 eventLoopUtilization=0.999

Preliminary Root Cause Analysis

The issue appears to be in how messages are handled during/after preflightCompaction:

  1. preflightCompaction executes synchronously inside the session lane (via enqueueCommandInLane in agent-runner.runtime-DCKwkWFL.js line ~1742)
  2. setPhase("preflight_compacting") is set but has no timeout protection
  3. During compaction, new messages from webchat arrive and are queued (queueDepth=1)
  4. After compaction completes, queued messages appear to not be properly dequeued/processed
  5. The session remains in state=processing indefinitely

The replyOperation.abortSignal passed to compaction does not have a timeout that would interrupt a slow compaction.

Possible Related Factors

  • Session file: /root/.openclaw/agents/main/sessions/b84aa148-4a29-4f2f-94e5-9b7296aabbf3.jsonl
  • Context overflow events: Context overflow: estimated context size exceeds safe threshold during tool loop (compaction attempts: 0 — meaning preflight compaction didn't run before overflow)
  • Compaction succeeded but didn't prevent the stuck state
  • Recovery required full Gateway restart, not just session reset

Tags

bug compaction session-lane webchat v2026.5.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions