Skip to content

Bug Report: Stuck sessions cause permanent gateway hang with no auto-recovery (v2026.4.26) #73510

@WS-Q0758

Description

@WS-Q0758

name: Bug Report
about: Stuck sessions cause gateway to become permanently unresponsive
labels: bug, stability, feishu

Summary

OpenClaw v2026.4.26 with Feishu WebSocket channel becomes permanently unresponsive when a session enters stuck state. The diagnostic system detects the problem but takes no recovery action, resulting in complete channel outage requiring manual intervention.

Environment

  • OpenClaw: v2026.4.26 (be8c246)
  • Channel: Feishu (WebSocket mode)
  • Model: bailian/qwen3.6-plus (Aliyun DashScope Coding Plan)
  • OS: Linux x64, 16GB RAM (Ubuntu)
  • Node.js: v22.22.2

Timeline of Events

Time (UTC+8) Event
16:25 First stuck session detected (stuck session age=150s) on Feishu DM session
16:25–16:32 Stuck session alarm repeats every 30s: 150s → 180s → 210s → 240s → 270s → 300s → 330s → 360s → 390s → 420s → 450s → 480s → 510s
16:33–16:38 Feishu messages received but dispatch returns replies=0 (responses silently dropped)
16:42–18:36 Gateway restarted 6+ times; each time briefly recovered then stuck again
18:36 Gateway running but Feishu channel still non-responsive

Total outage duration: >3 hours of repeated failures across multiple restarts.


Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)

Severity: 🔴 Critical — complete service outage

Description: The diagnostic subsystem correctly detects stuck sessions and logs warnings, but takes zero recovery action. Sessions remain permanently in state=processing with queueDepth=1, blocking all subsequent messages to that session.

Log Evidence:

{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}

The diagnostic fires every 30 seconds with increasing age (150s → 510s), but no kill, reset, or restart is triggered. The process becomes permanently unresponsive.

Impact: Gateway is functionally dead. No messages can be processed. Only manual restart helps, and even that is temporary if the root cause persists.

Expected behavior:

  • Stuck session timeout → kill the hung request
  • Or: auto-reset the affected session
  • Or: trigger gateway restart after N consecutive stuck detections

Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)

Severity: 🔴 Critical — prevents self-healing after restart

Description: When compaction.mode: safeguard and compaction.maxActiveTranscriptBytes: "20mb" are configured, compaction only runs as a preflight check before a new run. At gateway startup, existing large transcript files are loaded without compaction.

Our Feishu session transcript was 2.7MB / 1008 lines (with a trajectory file that grew to 14MB before being deleted by OpenClaw). On every restart, this 2.7MB file is fully loaded into memory, contributing to the event loop overload that causes the stuck session in Bug 1.

Since 2.7MB < 20MB threshold, the preflight compaction check never triggers either. The transcript accumulates indefinitely.

Impact: Large transcripts survive restarts and immediately re-create the conditions that caused the original crash. The compaction config is effectively useless for existing sessions.

Expected behavior:

  • Gateway startup should check transcript size and run compaction if needed
  • Or: add a row-based compaction threshold (e.g., >500 lines) in addition to byte-based
  • Or: compact all sessions on startup regardless of size

Bug 3: Trajectory Files Grow Without Bound

Severity: 🟡 High — memory bomb

Description: The Feishu session trajectory file grew to 14MB before OpenClaw eventually deleted it (leaving a *.trajectory.jsonl.deleted.* artifact). There is no configured size limit for trajectory files.

Evidence:

-rw------- 1 openclaw 14M agents/main/sessions/3e9dd919.trajectory.jsonl.deleted.2026-04-28T10-05-50.972Z

Impact: Unbounded trajectory growth contributes to memory pressure and eventual crash.

Expected behavior: Trajectory files should have independent size limits with automatic truncation.


Bug 4: No Memory Cap on Gateway Process

Severity: 🟡 Medium — can affect other services

Description: Gateway RSS memory grew continuously: 718MB (post-upgrade) → 865MB → 1.0GB peak. No --max-old-space-size parameter is set on the Node.js process, and no systemd MemoryMax= limit exists.

Memory Timeline:

Time RSS Memory Notes
Upgrade (04/28) 718MB Fresh start
16:42 restart ~818MB Before manual cleanup
17:31 restart 728MB After config changes
18:10 restart 671MB → 706MB 7 min growth
18:36 restart 865MB → peak 1.0GB Current

Expected behavior: Gateway should have configurable memory limits with graceful degradation or restart.


Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop

Severity: 🟡 Medium — cascading failures

Description: The Feishu channel sends a periodic ping to https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping. When this request times out (10 seconds), it blocks the Node.js event loop, causing all other HTTP requests to queue — including model API calls and other channel operations.

Log Evidence:

AxiosError: timeout of 10000ms exceeded
url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'

Impact: A single slow outbound request degrades the entire gateway. In our case, this created a cascading failure where the ping timeout contributed to model API timeouts, which in turn caused the stuck session.

Expected behavior: Heartbeat pings should run in a non-blocking manner with independent timeout handling.


Bug 6: Gateway Service PATH Configuration Incomplete

Severity: 🔵 Low — cosmetic but confusing

Description: openclaw gateway status reports:

Gateway service PATH missing required dirs: /home/openclaw/.local/share/fnm/aliases/default/bin
Recommendation: run "openclaw doctor --repair"

The PATH is not auto-fixed despite openclaw doctor --repair being available.


Root Cause Analysis

The failure chain is:

1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
     ↓
2. Large transcript (2.7MB) loaded on every dispatch
     ↓
3. Model API request takes too long → timeout
     ↓
4. Node.js event loop blocked (Bug 5 amplifies this)
     ↓
5. Session enters stuck state (queueDepth=1, state=processing)
     ↓
6. Stuck session detected but NO action taken (Bug 1)
     ↓
7. All subsequent messages dropped (replies=0)
     ↓
8. Complete channel outage until manual restart

Restarting does not fix the problem because the large transcript is reloaded without compaction (Bug 2).


Requested Actions

  1. Implement auto-recovery for stuck sessions (kill/reset/restart after configurable timeout)
  2. Trigger compaction at gateway startup for sessions exceeding configurable thresholds
  3. Add trajectory file size limits with automatic truncation
  4. Document recommended memory limits for Node.js gateway process
  5. Make Feishu ping non-blocking or add circuit breaker for failing outbound requests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions