Bug Report: Stuck sessions cause permanent gateway hang with no auto-recovery (v2026.4.26)

---
name: Bug Report
about: Stuck sessions cause gateway to become permanently unresponsive
labels: bug, stability, feishu
---

## Summary

OpenClaw v2026.4.26 with Feishu WebSocket channel becomes **permanently unresponsive** when a session enters `stuck` state. The diagnostic system detects the problem but takes no recovery action, resulting in complete channel outage requiring manual intervention.

## Environment

- **OpenClaw**: v2026.4.26 (be8c246)
- **Channel**: Feishu (WebSocket mode)
- **Model**: bailian/qwen3.6-plus (Aliyun DashScope Coding Plan)
- **OS**: Linux x64, 16GB RAM (Ubuntu)
- **Node.js**: v22.22.2

## Timeline of Events

| Time (UTC+8) | Event |
|---|---|
| 16:25 | First stuck session detected (`stuck session age=150s`) on Feishu DM session |
| 16:25–16:32 | Stuck session alarm repeats every 30s: 150s → 180s → 210s → 240s → 270s → 300s → 330s → 360s → 390s → 420s → 450s → 480s → 510s |
| 16:33–16:38 | Feishu messages received but dispatch returns `replies=0` (responses silently dropped) |
| 16:42–18:36 | Gateway restarted 6+ times; each time briefly recovered then stuck again |
| 18:36 | Gateway running but Feishu channel still non-responsive |

**Total outage duration**: >3 hours of repeated failures across multiple restarts.

---

## Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)

**Severity**: 🔴 Critical — complete service outage

**Description**: The `diagnostic` subsystem correctly detects stuck sessions and logs warnings, but **takes zero recovery action**. Sessions remain permanently in `state=processing` with `queueDepth=1`, blocking all subsequent messages to that session.

**Log Evidence**:
```
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}
```

The diagnostic fires every 30 seconds with increasing age (150s → 510s), but no kill, reset, or restart is triggered. The process becomes permanently unresponsive.

**Impact**: Gateway is functionally dead. No messages can be processed. Only manual restart helps, and even that is temporary if the root cause persists.

**Expected behavior**: 
- Stuck session timeout → kill the hung request
- Or: auto-reset the affected session
- Or: trigger gateway restart after N consecutive stuck detections

---

## Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)

**Severity**: 🔴 Critical — prevents self-healing after restart

**Description**: When `compaction.mode: safeguard` and `compaction.maxActiveTranscriptBytes: "20mb"` are configured, compaction only runs as a preflight check before a new run. **At gateway startup, existing large transcript files are loaded without compaction.**

Our Feishu session transcript was **2.7MB / 1008 lines** (with a trajectory file that grew to 14MB before being deleted by OpenClaw). On every restart, this 2.7MB file is fully loaded into memory, contributing to the event loop overload that causes the stuck session in Bug 1.

Since 2.7MB < 20MB threshold, the preflight compaction check never triggers either. The transcript accumulates indefinitely.

**Impact**: Large transcripts survive restarts and immediately re-create the conditions that caused the original crash. The compaction config is effectively useless for existing sessions.

**Expected behavior**:
- Gateway startup should check transcript size and run compaction if needed
- Or: add a row-based compaction threshold (e.g., >500 lines) in addition to byte-based
- Or: compact all sessions on startup regardless of size

---

## Bug 3: Trajectory Files Grow Without Bound

**Severity**: 🟡 High — memory bomb

**Description**: The Feishu session trajectory file grew to **14MB** before OpenClaw eventually deleted it (leaving a `*.trajectory.jsonl.deleted.*` artifact). There is no configured size limit for trajectory files.

**Evidence**:
```
-rw------- 1 openclaw 14M agents/main/sessions/3e9dd919.trajectory.jsonl.deleted.2026-04-28T10-05-50.972Z
```

**Impact**: Unbounded trajectory growth contributes to memory pressure and eventual crash.

**Expected behavior**: Trajectory files should have independent size limits with automatic truncation.

---

## Bug 4: No Memory Cap on Gateway Process

**Severity**: 🟡 Medium — can affect other services

**Description**: Gateway RSS memory grew continuously: 718MB (post-upgrade) → 865MB → 1.0GB peak. No `--max-old-space-size` parameter is set on the Node.js process, and no systemd `MemoryMax=` limit exists.

**Memory Timeline**:
| Time | RSS Memory | Notes |
|---|---|---|
| Upgrade (04/28) | 718MB | Fresh start |
| 16:42 restart | ~818MB | Before manual cleanup |
| 17:31 restart | 728MB | After config changes |
| 18:10 restart | 671MB → 706MB | 7 min growth |
| 18:36 restart | 865MB → peak 1.0GB | Current |

**Expected behavior**: Gateway should have configurable memory limits with graceful degradation or restart.

---

## Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop

**Severity**: 🟡 Medium — cascading failures

**Description**: The Feishu channel sends a periodic ping to `https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping`. When this request times out (10 seconds), it blocks the Node.js event loop, causing **all other HTTP requests to queue** — including model API calls and other channel operations.

**Log Evidence**:
```
AxiosError: timeout of 10000ms exceeded
url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'
```

**Impact**: A single slow outbound request degrades the entire gateway. In our case, this created a cascading failure where the ping timeout contributed to model API timeouts, which in turn caused the stuck session.

**Expected behavior**: Heartbeat pings should run in a non-blocking manner with independent timeout handling.

---

## Bug 6: Gateway Service PATH Configuration Incomplete

**Severity**: 🔵 Low — cosmetic but confusing

**Description**: `openclaw gateway status` reports:
```
Gateway service PATH missing required dirs: /home/openclaw/.local/share/fnm/aliases/default/bin
Recommendation: run "openclaw doctor --repair"
```

The PATH is not auto-fixed despite `openclaw doctor --repair` being available.

---

## Root Cause Analysis

The failure chain is:

```
1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
     ↓
2. Large transcript (2.7MB) loaded on every dispatch
     ↓
3. Model API request takes too long → timeout
     ↓
4. Node.js event loop blocked (Bug 5 amplifies this)
     ↓
5. Session enters stuck state (queueDepth=1, state=processing)
     ↓
6. Stuck session detected but NO action taken (Bug 1)
     ↓
7. All subsequent messages dropped (replies=0)
     ↓
8. Complete channel outage until manual restart
```

Restarting does not fix the problem because the large transcript is reloaded without compaction (Bug 2).

---

## Requested Actions

1. **Implement auto-recovery for stuck sessions** (kill/reset/restart after configurable timeout)
2. **Trigger compaction at gateway startup** for sessions exceeding configurable thresholds
3. **Add trajectory file size limits** with automatic truncation
4. **Document recommended memory limits** for Node.js gateway process
5. **Make Feishu ping non-blocking** or add circuit breaker for failing outbound requests


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report: Stuck sessions cause permanent gateway hang with no auto-recovery (v2026.4.26) #73510

name: Bug Report
about: Stuck sessions cause gateway to become permanently unresponsive
labels: bug, stability, feishu

Summary

Environment

Timeline of Events

Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)

Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)

Bug 3: Trajectory Files Grow Without Bound

Bug 4: No Memory Cap on Gateway Process

Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop

Bug 6: Gateway Service PATH Configuration Incomplete

Root Cause Analysis

Requested Actions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time (UTC+8)	Event
16:25	First stuck session detected (`stuck session age=150s`) on Feishu DM session
16:25–16:32	Stuck session alarm repeats every 30s: 150s → 180s → 210s → 240s → 270s → 300s → 330s → 360s → 390s → 420s → 450s → 480s → 510s
16:33–16:38	Feishu messages received but dispatch returns `replies=0` (responses silently dropped)
16:42–18:36	Gateway restarted 6+ times; each time briefly recovered then stuck again
18:36	Gateway running but Feishu channel still non-responsive

Time	RSS Memory	Notes
Upgrade (04/28)	718MB	Fresh start
16:42 restart	~818MB	Before manual cleanup
17:31 restart	728MB	After config changes
18:10 restart	671MB → 706MB	7 min growth
18:36 restart	865MB → peak 1.0GB	Current

Uh oh!

Bug Report: Stuck sessions cause permanent gateway hang with no auto-recovery (v2026.4.26) #73510

Description

name: Bug Report about: Stuck sessions cause gateway to become permanently unresponsive labels: bug, stability, feishu

Summary

Environment

Timeline of Events

Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)

Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)

Bug 3: Trajectory Files Grow Without Bound

Bug 4: No Memory Cap on Gateway Process

Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop

Bug 6: Gateway Service PATH Configuration Incomplete

Root Cause Analysis

Requested Actions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

name: Bug Report
about: Stuck sessions cause gateway to become permanently unresponsive
labels: bug, stability, feishu