-
-
Notifications
You must be signed in to change notification settings - Fork 79.2k
Gateway process alive but event loop frozen — all HTTP requests silently timeout #56733
Copy link
Copy link
Open
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.Crash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.staleMarked as stale due to inactivityMarked as stale due to inactivity
Metadata
Metadata
Assignees
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.Crash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.staleMarked as stale due to inactivityMarked as stale due to inactivity
Type
Fields
Give feedbackNo fields configured for issues without a type.
Bug Report
Version: 2026.3.24 (cff6dc9)
Platform: WSL2 on Windows 11 (Linux 6.6.87.2-microsoft-standard-WSL2, x64)
Node.js: v24.14.0
Deployment: systemd user service (
openclaw-gateway.service)Summary
The Gateway process remains alive and systemd reports it as
active (running), but the Node.js event loop becomes unresponsive. All outbound HTTP requests — both LLM providers (zai, openrouter) and channel APIs (Feishu) — time out. The Gateway enters a "zombie" state: it appears healthy to the service manager but is completely unable to process any messages. Recovery requires a manual restart.This occurs consistently during low-activity hours (00:00–06:00) and has reproduced across 4 consecutive nights.
Observed Pattern (4 consecutive nights)
Timeline (typical night)
Key Observations
Log Evidence
Feishu plugin re-registration (normal, event loop alive)
21-minute gap — no log output at all
First timeout (cascade begins)
Subsequent cascade — everything times out
Gateway still appears healthy to systemd
Environment
networkingMode=mirroredwsl.conf), linger enabledWorkarounds Applied
Suggested Fixes
Event loop health monitoring: Add a periodic
setIntervalthat logs event loop latency (e.g.,Date.now() - expectedTick). If latency exceeds a threshold (e.g., 30s), log a warning with a stack trace to identify the blocking operation.Watchdog with auto-restart: Allow configuring a health check endpoint or command that systemd can use with
WatchdogSec=to detect and automatically restart a frozen Gateway.Reduce silent gap: Log periodic heartbeat/internal activity so the freeze period is visible in logs and can be correlated with internal operations.
Related: Feishu plugin tools re-register on every agent dispatch, producing excessive log noise #56695 (Feishu plugin re-registration on every dispatch adds unnecessary overhead during idle periods)
Possible Root Causes (speculative)
dns.lookup(synchronous in libuv thread pool) blocks when thread pool is exhaustedUnable to confirm due to zero log output during the freeze period.