Skip to content

Stuck session ghost blocks event loop, causes CLI timeout and sustained high CPU (11.5hr+) #77115

@mmhzlrj

Description

@mmhzlrj

Summary

A zombie/stuck session (sessionId=41befe2c-) persists in the Gateway's in-memory state machine for 11.5+ hours while absent from the session store. The diagnostic system repeatedly detects it as stuck, causing sustained 48% CPU, event loop P99 delays up to 12.8s, and CLI commands receiving SIGKILL.

Environment

  • OpenClaw version: 2026.4.27 (cbc2ba0)
  • Gateway: macOS (Darwin 21.6.0, x64), node v22.22.0, port 18789
  • Model: deepseek/deepseek-v4-pro (main), minimax-portal/MiniMax-M2.7 (subagents)
  • Setup: MCP servers (playwright, qwen, kimi, glm, doubao x5 all stdio), Ubuntu node (192.168.1.18)

Symptoms

1. Stuck Session Ghost

[diagnostic] stuck session: sessionId=41befe2c-8c01-4f75-a662-2d47eb34d7bc
  sessionKey=agent:main:main state=processing age=41514s queueDepth=0
  • First detected: ~10:00 AM 2026-05-04
  • Age at detection: 11,514 seconds (~3.2 hours into a 11.5+ hour run)
  • state=processing, queueDepth oscillates 0→1→0
  • NOT in session store~/.openclaw/agents/main/sessions/ has no files matching *41befe2c*
  • NOT in sessions_listsessions_list({search:"41befe2c"}) returns count=0
  • NOT in sessions.json — no entry in store metadata

2. Event Loop Degradation

[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
  cpuCoreRatio=1.289 active=1 waiting=0 queued=5
  • P99 event loop delay spiked to 12,876ms (12.8 seconds)
  • Event loop utilization hitting 0.998 (99.8%)
  • CPU core ratio at 1.289 with 5 queued requests
  • Gateway process (PID 23979): 48.7% sustained CPU, 1.5GB RSS

3. CLI Universal Timeout

  • openclaw cron list → SIGKILL
  • openclaw cron rm → SIGKILL
  • openclaw gateway status → SIGKILL
  • openclaw sessions list → hanging
  • All CLI commands became non-functional

4. Exec Timeout Cascades

[tools] exec failed: TIMEOUT: node invoke timed out
  raw_params={"command":"sleep 300 && cat /tmp/...progress.json","host":"node","timeout":310}

Subagents calling exec({host:"node"}) with sleep-based polling hit repeated timeouts, further blocking the event loop.

Root Cause Analysis

The stuck session likely originates from repeated openclaw cron CLI invocations that timed out during cron list/rm operations. Each CLI invocation creates a Gateway operator session. When multiple CLI calls timeout simultaneously, the session lifecycle cleanup may fail to remove the session from the Gateway's internal state tracking, even though the session store is properly cleaned.

Key observations:

  • Session exists only in memory (diagnostic system) but not on disk (store)
  • The diagnostic stuck session checker runs every 30 seconds and detects it, likely doing a tight-loop scan each time
  • The Gateway restart (clearing in-memory state) resolved the issue immediately

Reproduction Steps

  1. Run multiple openclaw cron list/rm CLI commands rapidly during high Gateway load
  2. Wait for CLI timeout (SIGKILL)
  3. Observe diagnostic stuck session entries persisting in gateway.err.log
  4. Observe declining event loop health and rising CPU

Expected Behavior

  • Stale session references should be cleaned from the in-memory state tracker within a reasonable timeout (e.g., 5 minutes)
  • Session lifecycle cleanup should handle SIGKILL/timeout scenarios gracefully
  • The diagnostic stuck session detector should not tight-loop on the same ghost session indefinitely

Actual Behavior

  • Ghost session persisted 11.5+ hours until Gateway restart
  • Sustained 48% CPU for 11+ hours
  • CLI became non-functional
  • No automatic recovery

Fix

Gateway restart cleared all in-memory state and restored normal operation (CPU dropped to normal, CLI responsive).

Suggested code fix:

  1. Add a max-age threshold for stuck session tracking (e.g., if a session has been "stuck" in processing state for > 10 minutes and doesn't exist in the session store, evict it from the state tracker)
  2. Add exponential backoff to the stuck-session checker to avoid tight-looping on the same stuck reference
  3. Ensure session cleanup on CLI timeout/SIGKILL scenarios covers both store AND in-memory state

Impact

  • Severity: High (causes Gateway degradation, blocks agent operation)
  • Frequency: Triggered 1x so far (correlated with CLI timeout during cron operations)
  • Recovery: Only Gateway restart resolves

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions