Gateway process alive but event loop frozen — all HTTP requests silently timeout

## Bug Report

**Version:** 2026.3.24 (cff6dc9)
**Platform:** WSL2 on Windows 11 (Linux 6.6.87.2-microsoft-standard-WSL2, x64)
**Node.js:** v24.14.0
**Deployment:** systemd user service (`openclaw-gateway.service`)

## Summary

The Gateway process remains alive and systemd reports it as `active (running)`, but the Node.js event loop becomes unresponsive. All outbound HTTP requests — both LLM providers (zai, openrouter) and channel APIs (Feishu) — time out. The Gateway enters a "zombie" state: it appears healthy to the service manager but is completely unable to process any messages. Recovery requires a manual restart.

This occurs consistently during low-activity hours (00:00–06:00) and has reproduced across 4 consecutive nights.

## Observed Pattern (4 consecutive nights)

| Date | Feishu Plugin Registration | Silent Gap | First Timeout | Timeout Type |
|------|---------------------------|------------|---------------|--------------|
| Mar 26 | 00:00:16 (4 groups) | 19 min | 00:19:38 | LLM (10s) |
| Mar 27 | 03:09:52 (4 groups) | 10 min | 03:20:03 | LLM (10s) |
| Mar 28 | 05:15:10 (5 groups) | 19 min | 05:34:12 | Feishu bot/info (10s) |
| Mar 29 | 04:26:33 (4 groups) | 21 min | 04:47:09 | Feishu token refresh (30s) |

### Timeline (typical night)

1. Multiple agent sessions initialize simultaneously → Feishu plugin tools re-register (see #56695)
2. 10–21 minute period with **zero log output**
3. A single HTTP request times out (first visible error)
4. **All subsequent HTTP requests** cascade into timeouts — both LLM and Feishu API
5. Gateway remains in this state indefinitely until manually restarted

### Key Observations

- **The silent gap is the critical period**: Plugin registration completes successfully, proving the event loop was healthy at that moment. The freeze occurs during the gap, but **no logs are generated** during the gap — all internal operations (WebSocket heartbeats, DNS resolution, TCP keepalive, token expiry checks, memory watch) produce zero output.
- **First timeout API varies**: LLM (10s) and Feishu (10s/30s) both appear as first failure — not tied to a specific endpoint.
- **systemd does NOT detect the failure**: The process is alive, CPU/memory look normal, so systemd considers it healthy.
- **Daytime does not reproduce**: With active user interactions and frequent log output, the issue never occurs. The long idle periods at night are required.
- **Diagnostic script shows no system-level issues**: Memory (195.7M peak), file descriptors, TCP connections, DNS resolution, and network connectivity are all normal during the frozen state.

## Log Evidence

### Feishu plugin re-registration (normal, event loop alive)
```
Mar 29 04:26:33 node[408]: [plugins] feishu_doc: Registered feishu_doc, feishu_app_scopes
Mar 29 04:26:33 node[408]: [plugins] feishu_chat: Registered feishu_chat tool
Mar 29 04:26:33 node[408]: [plugins] feishu_wiki: Registered feishu_wiki tool
Mar 29 04:26:33 node[408]: [plugins] feishu_drive: Registered feishu_drive tool
Mar 29 04:26:33 node[408]: [plugins] feishu_bitable: Registered bitable tools
```

### 21-minute gap — no log output at all
```
(no entries between 04:26:33 and 04:47:09)
```

### First timeout (cascade begins)
```
Mar 29 04:47:09 node[408]: [error]: AxiosError: timeout of 30000ms exceeded
  code: 'ECONNABORTED'
  config.url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'
```

### Subsequent cascade — everything times out
```
Mar 29 07:15:38 node[408]: [agent/embedded] embedded run agent end: isError=true error=LLM request timed out.
Mar 29 07:26:09 node[408]: [error]: AxiosError: timeout of 30000ms exceeded
Mar 29 08:30:00 node[408]: [error]: AxiosError: timeout of 30000ms exceeded
```

### Gateway still appears healthy to systemd
```
$ systemctl --user status openclaw-gateway
● openclaw-gateway.service - OpenClaw Gateway (v2026.3.24)
     Active: active (running) since Sun 2026-03-29 01:18:06 CST
   Main PID: 408 (node)
      Tasks: 15 (limit: 9342)
     Memory: 195.7M
```

## Environment

- **OS:** WSL2 (Ubuntu) on Windows 11, `networkingMode=mirrored`
- **Memory:** 6GB allocated to WSL, swap 2GB
- **systemd:** enabled (`wsl.conf`), linger enabled
- **Channels:** Feishu (WebSocket long connection), webchat
- **LLM Providers:** zai (GLM-5-turbo), openrouter (Qwen3)
- **Cron jobs:** rclone sync (every hour, now changed to every 4 hours), heartbeat polling
- **Multiple agents configured:** 4 agents with Feishu plugin tools

## Workarounds Applied

1. Reduced rclone sync frequency (hourly → every 4 hours) to reduce I/O pressure
2. Monitoring via heartbeat checks and HEARTBEAT.md
3. Manual restart when detected

## Suggested Fixes

1. **Event loop health monitoring**: Add a periodic `setInterval` that logs event loop latency (e.g., `Date.now() - expectedTick`). If latency exceeds a threshold (e.g., 30s), log a warning with a stack trace to identify the blocking operation.

2. **Watchdog with auto-restart**: Allow configuring a health check endpoint or command that systemd can use with `WatchdogSec=` to detect and automatically restart a frozen Gateway.

3. **Reduce silent gap**: Log periodic heartbeat/internal activity so the freeze period is visible in logs and can be correlated with internal operations.

4. **Related**: #56695 (Feishu plugin re-registration on every dispatch adds unnecessary overhead during idle periods)

## Possible Root Causes (speculative)

- Feishu WebSocket long connection silently disconnects, reconnection blocks the event loop
- DNS resolution via `dns.lookup` (synchronous in libuv thread pool) blocks when thread pool is exhausted
- TCP dead connections not cleaned up, subsequent requests hang on stale sockets
- Memory garbage collection pause during idle period

Unable to confirm due to zero log output during the freeze period.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway process alive but event loop frozen — all HTTP requests silently timeout #56733

Bug Report

Summary

Observed Pattern (4 consecutive nights)

Timeline (typical night)

Key Observations

Log Evidence

Feishu plugin re-registration (normal, event loop alive)

21-minute gap — no log output at all

First timeout (cascade begins)

Subsequent cascade — everything times out

Gateway still appears healthy to systemd

Environment

Workarounds Applied

Suggested Fixes

Possible Root Causes (speculative)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Date	Feishu Plugin Registration	Silent Gap	First Timeout	Timeout Type
Mar 26	00:00:16 (4 groups)	19 min	00:19:38	LLM (10s)
Mar 27	03:09:52 (4 groups)	10 min	03:20:03	LLM (10s)
Mar 28	05:15:10 (5 groups)	19 min	05:34:12	Feishu bot/info (10s)
Mar 29	04:26:33 (4 groups)	21 min	04:47:09	Feishu token refresh (30s)

Uh oh!

Gateway process alive but event loop frozen — all HTTP requests silently timeout #56733

Description

Bug Report

Summary

Observed Pattern (4 consecutive nights)

Timeline (typical night)

Key Observations

Log Evidence

Feishu plugin re-registration (normal, event loop alive)

21-minute gap — no log output at all

First timeout (cascade begins)

Subsequent cascade — everything times out

Gateway still appears healthy to systemd

Environment

Workarounds Applied

Suggested Fixes

Possible Root Causes (speculative)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions