name: Bug Report
about: Stuck sessions cause gateway to become permanently unresponsive
labels: bug, stability, feishu
Summary
OpenClaw v2026.4.26 with Feishu WebSocket channel becomes permanently unresponsive when a session enters stuck state. The diagnostic system detects the problem but takes no recovery action, resulting in complete channel outage requiring manual intervention.
Environment
- OpenClaw: v2026.4.26 (be8c246)
- Channel: Feishu (WebSocket mode)
- Model: bailian/qwen3.6-plus (Aliyun DashScope Coding Plan)
- OS: Linux x64, 16GB RAM (Ubuntu)
- Node.js: v22.22.2
Timeline of Events
| Time (UTC+8) |
Event |
| 16:25 |
First stuck session detected (stuck session age=150s) on Feishu DM session |
| 16:25–16:32 |
Stuck session alarm repeats every 30s: 150s → 180s → 210s → 240s → 270s → 300s → 330s → 360s → 390s → 420s → 450s → 480s → 510s |
| 16:33–16:38 |
Feishu messages received but dispatch returns replies=0 (responses silently dropped) |
| 16:42–18:36 |
Gateway restarted 6+ times; each time briefly recovered then stuck again |
| 18:36 |
Gateway running but Feishu channel still non-responsive |
Total outage duration: >3 hours of repeated failures across multiple restarts.
Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)
Severity: 🔴 Critical — complete service outage
Description: The diagnostic subsystem correctly detects stuck sessions and logs warnings, but takes zero recovery action. Sessions remain permanently in state=processing with queueDepth=1, blocking all subsequent messages to that session.
Log Evidence:
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}
The diagnostic fires every 30 seconds with increasing age (150s → 510s), but no kill, reset, or restart is triggered. The process becomes permanently unresponsive.
Impact: Gateway is functionally dead. No messages can be processed. Only manual restart helps, and even that is temporary if the root cause persists.
Expected behavior:
- Stuck session timeout → kill the hung request
- Or: auto-reset the affected session
- Or: trigger gateway restart after N consecutive stuck detections
Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)
Severity: 🔴 Critical — prevents self-healing after restart
Description: When compaction.mode: safeguard and compaction.maxActiveTranscriptBytes: "20mb" are configured, compaction only runs as a preflight check before a new run. At gateway startup, existing large transcript files are loaded without compaction.
Our Feishu session transcript was 2.7MB / 1008 lines (with a trajectory file that grew to 14MB before being deleted by OpenClaw). On every restart, this 2.7MB file is fully loaded into memory, contributing to the event loop overload that causes the stuck session in Bug 1.
Since 2.7MB < 20MB threshold, the preflight compaction check never triggers either. The transcript accumulates indefinitely.
Impact: Large transcripts survive restarts and immediately re-create the conditions that caused the original crash. The compaction config is effectively useless for existing sessions.
Expected behavior:
- Gateway startup should check transcript size and run compaction if needed
- Or: add a row-based compaction threshold (e.g., >500 lines) in addition to byte-based
- Or: compact all sessions on startup regardless of size
Bug 3: Trajectory Files Grow Without Bound
Severity: 🟡 High — memory bomb
Description: The Feishu session trajectory file grew to 14MB before OpenClaw eventually deleted it (leaving a *.trajectory.jsonl.deleted.* artifact). There is no configured size limit for trajectory files.
Evidence:
-rw------- 1 openclaw 14M agents/main/sessions/3e9dd919.trajectory.jsonl.deleted.2026-04-28T10-05-50.972Z
Impact: Unbounded trajectory growth contributes to memory pressure and eventual crash.
Expected behavior: Trajectory files should have independent size limits with automatic truncation.
Bug 4: No Memory Cap on Gateway Process
Severity: 🟡 Medium — can affect other services
Description: Gateway RSS memory grew continuously: 718MB (post-upgrade) → 865MB → 1.0GB peak. No --max-old-space-size parameter is set on the Node.js process, and no systemd MemoryMax= limit exists.
Memory Timeline:
| Time |
RSS Memory |
Notes |
| Upgrade (04/28) |
718MB |
Fresh start |
| 16:42 restart |
~818MB |
Before manual cleanup |
| 17:31 restart |
728MB |
After config changes |
| 18:10 restart |
671MB → 706MB |
7 min growth |
| 18:36 restart |
865MB → peak 1.0GB |
Current |
Expected behavior: Gateway should have configurable memory limits with graceful degradation or restart.
Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop
Severity: 🟡 Medium — cascading failures
Description: The Feishu channel sends a periodic ping to https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping. When this request times out (10 seconds), it blocks the Node.js event loop, causing all other HTTP requests to queue — including model API calls and other channel operations.
Log Evidence:
AxiosError: timeout of 10000ms exceeded
url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'
Impact: A single slow outbound request degrades the entire gateway. In our case, this created a cascading failure where the ping timeout contributed to model API timeouts, which in turn caused the stuck session.
Expected behavior: Heartbeat pings should run in a non-blocking manner with independent timeout handling.
Bug 6: Gateway Service PATH Configuration Incomplete
Severity: 🔵 Low — cosmetic but confusing
Description: openclaw gateway status reports:
Gateway service PATH missing required dirs: /home/openclaw/.local/share/fnm/aliases/default/bin
Recommendation: run "openclaw doctor --repair"
The PATH is not auto-fixed despite openclaw doctor --repair being available.
Root Cause Analysis
The failure chain is:
1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
↓
2. Large transcript (2.7MB) loaded on every dispatch
↓
3. Model API request takes too long → timeout
↓
4. Node.js event loop blocked (Bug 5 amplifies this)
↓
5. Session enters stuck state (queueDepth=1, state=processing)
↓
6. Stuck session detected but NO action taken (Bug 1)
↓
7. All subsequent messages dropped (replies=0)
↓
8. Complete channel outage until manual restart
Restarting does not fix the problem because the large transcript is reloaded without compaction (Bug 2).
Requested Actions
- Implement auto-recovery for stuck sessions (kill/reset/restart after configurable timeout)
- Trigger compaction at gateway startup for sessions exceeding configurable thresholds
- Add trajectory file size limits with automatic truncation
- Document recommended memory limits for Node.js gateway process
- Make Feishu ping non-blocking or add circuit breaker for failing outbound requests
name: Bug Report
about: Stuck sessions cause gateway to become permanently unresponsive
labels: bug, stability, feishu
Summary
OpenClaw v2026.4.26 with Feishu WebSocket channel becomes permanently unresponsive when a session enters
stuckstate. The diagnostic system detects the problem but takes no recovery action, resulting in complete channel outage requiring manual intervention.Environment
Timeline of Events
stuck session age=150s) on Feishu DM sessionreplies=0(responses silently dropped)Total outage duration: >3 hours of repeated failures across multiple restarts.
Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)
Severity: 🔴 Critical — complete service outage
Description: The
diagnosticsubsystem correctly detects stuck sessions and logs warnings, but takes zero recovery action. Sessions remain permanently instate=processingwithqueueDepth=1, blocking all subsequent messages to that session.Log Evidence:
The diagnostic fires every 30 seconds with increasing age (150s → 510s), but no kill, reset, or restart is triggered. The process becomes permanently unresponsive.
Impact: Gateway is functionally dead. No messages can be processed. Only manual restart helps, and even that is temporary if the root cause persists.
Expected behavior:
Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)
Severity: 🔴 Critical — prevents self-healing after restart
Description: When
compaction.mode: safeguardandcompaction.maxActiveTranscriptBytes: "20mb"are configured, compaction only runs as a preflight check before a new run. At gateway startup, existing large transcript files are loaded without compaction.Our Feishu session transcript was 2.7MB / 1008 lines (with a trajectory file that grew to 14MB before being deleted by OpenClaw). On every restart, this 2.7MB file is fully loaded into memory, contributing to the event loop overload that causes the stuck session in Bug 1.
Since 2.7MB < 20MB threshold, the preflight compaction check never triggers either. The transcript accumulates indefinitely.
Impact: Large transcripts survive restarts and immediately re-create the conditions that caused the original crash. The compaction config is effectively useless for existing sessions.
Expected behavior:
Bug 3: Trajectory Files Grow Without Bound
Severity: 🟡 High — memory bomb
Description: The Feishu session trajectory file grew to 14MB before OpenClaw eventually deleted it (leaving a
*.trajectory.jsonl.deleted.*artifact). There is no configured size limit for trajectory files.Evidence:
Impact: Unbounded trajectory growth contributes to memory pressure and eventual crash.
Expected behavior: Trajectory files should have independent size limits with automatic truncation.
Bug 4: No Memory Cap on Gateway Process
Severity: 🟡 Medium — can affect other services
Description: Gateway RSS memory grew continuously: 718MB (post-upgrade) → 865MB → 1.0GB peak. No
--max-old-space-sizeparameter is set on the Node.js process, and no systemdMemoryMax=limit exists.Memory Timeline:
Expected behavior: Gateway should have configurable memory limits with graceful degradation or restart.
Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop
Severity: 🟡 Medium — cascading failures
Description: The Feishu channel sends a periodic ping to
https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping. When this request times out (10 seconds), it blocks the Node.js event loop, causing all other HTTP requests to queue — including model API calls and other channel operations.Log Evidence:
Impact: A single slow outbound request degrades the entire gateway. In our case, this created a cascading failure where the ping timeout contributed to model API timeouts, which in turn caused the stuck session.
Expected behavior: Heartbeat pings should run in a non-blocking manner with independent timeout handling.
Bug 6: Gateway Service PATH Configuration Incomplete
Severity: 🔵 Low — cosmetic but confusing
Description:
openclaw gateway statusreports:The PATH is not auto-fixed despite
openclaw doctor --repairbeing available.Root Cause Analysis
The failure chain is:
Restarting does not fix the problem because the large transcript is reloaded without compaction (Bug 2).
Requested Actions