Skip to content

Gateway process alive but event loop frozen — all HTTP requests silently timeout #56733

@Kaiji-Z

Description

@Kaiji-Z

Bug Report

Version: 2026.3.24 (cff6dc9)
Platform: WSL2 on Windows 11 (Linux 6.6.87.2-microsoft-standard-WSL2, x64)
Node.js: v24.14.0
Deployment: systemd user service (openclaw-gateway.service)

Summary

The Gateway process remains alive and systemd reports it as active (running), but the Node.js event loop becomes unresponsive. All outbound HTTP requests — both LLM providers (zai, openrouter) and channel APIs (Feishu) — time out. The Gateway enters a "zombie" state: it appears healthy to the service manager but is completely unable to process any messages. Recovery requires a manual restart.

This occurs consistently during low-activity hours (00:00–06:00) and has reproduced across 4 consecutive nights.

Observed Pattern (4 consecutive nights)

Date Feishu Plugin Registration Silent Gap First Timeout Timeout Type
Mar 26 00:00:16 (4 groups) 19 min 00:19:38 LLM (10s)
Mar 27 03:09:52 (4 groups) 10 min 03:20:03 LLM (10s)
Mar 28 05:15:10 (5 groups) 19 min 05:34:12 Feishu bot/info (10s)
Mar 29 04:26:33 (4 groups) 21 min 04:47:09 Feishu token refresh (30s)

Timeline (typical night)

  1. Multiple agent sessions initialize simultaneously → Feishu plugin tools re-register (see Feishu plugin tools re-register on every agent dispatch, producing excessive log noise #56695)
  2. 10–21 minute period with zero log output
  3. A single HTTP request times out (first visible error)
  4. All subsequent HTTP requests cascade into timeouts — both LLM and Feishu API
  5. Gateway remains in this state indefinitely until manually restarted

Key Observations

  • The silent gap is the critical period: Plugin registration completes successfully, proving the event loop was healthy at that moment. The freeze occurs during the gap, but no logs are generated during the gap — all internal operations (WebSocket heartbeats, DNS resolution, TCP keepalive, token expiry checks, memory watch) produce zero output.
  • First timeout API varies: LLM (10s) and Feishu (10s/30s) both appear as first failure — not tied to a specific endpoint.
  • systemd does NOT detect the failure: The process is alive, CPU/memory look normal, so systemd considers it healthy.
  • Daytime does not reproduce: With active user interactions and frequent log output, the issue never occurs. The long idle periods at night are required.
  • Diagnostic script shows no system-level issues: Memory (195.7M peak), file descriptors, TCP connections, DNS resolution, and network connectivity are all normal during the frozen state.

Log Evidence

Feishu plugin re-registration (normal, event loop alive)

Mar 29 04:26:33 node[408]: [plugins] feishu_doc: Registered feishu_doc, feishu_app_scopes
Mar 29 04:26:33 node[408]: [plugins] feishu_chat: Registered feishu_chat tool
Mar 29 04:26:33 node[408]: [plugins] feishu_wiki: Registered feishu_wiki tool
Mar 29 04:26:33 node[408]: [plugins] feishu_drive: Registered feishu_drive tool
Mar 29 04:26:33 node[408]: [plugins] feishu_bitable: Registered bitable tools

21-minute gap — no log output at all

(no entries between 04:26:33 and 04:47:09)

First timeout (cascade begins)

Mar 29 04:47:09 node[408]: [error]: AxiosError: timeout of 30000ms exceeded
  code: 'ECONNABORTED'
  config.url: 'https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal'

Subsequent cascade — everything times out

Mar 29 07:15:38 node[408]: [agent/embedded] embedded run agent end: isError=true error=LLM request timed out.
Mar 29 07:26:09 node[408]: [error]: AxiosError: timeout of 30000ms exceeded
Mar 29 08:30:00 node[408]: [error]: AxiosError: timeout of 30000ms exceeded

Gateway still appears healthy to systemd

$ systemctl --user status openclaw-gateway
● openclaw-gateway.service - OpenClaw Gateway (v2026.3.24)
     Active: active (running) since Sun 2026-03-29 01:18:06 CST
   Main PID: 408 (node)
      Tasks: 15 (limit: 9342)
     Memory: 195.7M

Environment

  • OS: WSL2 (Ubuntu) on Windows 11, networkingMode=mirrored
  • Memory: 6GB allocated to WSL, swap 2GB
  • systemd: enabled (wsl.conf), linger enabled
  • Channels: Feishu (WebSocket long connection), webchat
  • LLM Providers: zai (GLM-5-turbo), openrouter (Qwen3)
  • Cron jobs: rclone sync (every hour, now changed to every 4 hours), heartbeat polling
  • Multiple agents configured: 4 agents with Feishu plugin tools

Workarounds Applied

  1. Reduced rclone sync frequency (hourly → every 4 hours) to reduce I/O pressure
  2. Monitoring via heartbeat checks and HEARTBEAT.md
  3. Manual restart when detected

Suggested Fixes

  1. Event loop health monitoring: Add a periodic setInterval that logs event loop latency (e.g., Date.now() - expectedTick). If latency exceeds a threshold (e.g., 30s), log a warning with a stack trace to identify the blocking operation.

  2. Watchdog with auto-restart: Allow configuring a health check endpoint or command that systemd can use with WatchdogSec= to detect and automatically restart a frozen Gateway.

  3. Reduce silent gap: Log periodic heartbeat/internal activity so the freeze period is visible in logs and can be correlated with internal operations.

  4. Related: Feishu plugin tools re-register on every agent dispatch, producing excessive log noise #56695 (Feishu plugin re-registration on every dispatch adds unnecessary overhead during idle periods)

Possible Root Causes (speculative)

  • Feishu WebSocket long connection silently disconnects, reconnection blocks the event loop
  • DNS resolution via dns.lookup (synchronous in libuv thread pool) blocks when thread pool is exhausted
  • TCP dead connections not cleaned up, subsequent requests hang on stale sockets
  • Memory garbage collection pause during idle period

Unable to confirm due to zero log output during the freeze period.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions