Summary
A zombie/stuck session (sessionId=41befe2c-) persists in the Gateway's in-memory state machine for 11.5+ hours while absent from the session store. The diagnostic system repeatedly detects it as stuck, causing sustained 48% CPU, event loop P99 delays up to 12.8s, and CLI commands receiving SIGKILL.
Environment
- OpenClaw version: 2026.4.27 (cbc2ba0)
- Gateway: macOS (Darwin 21.6.0, x64), node v22.22.0, port 18789
- Model: deepseek/deepseek-v4-pro (main), minimax-portal/MiniMax-M2.7 (subagents)
- Setup: MCP servers (playwright, qwen, kimi, glm, doubao x5 all stdio), Ubuntu node (192.168.1.18)
Symptoms
1. Stuck Session Ghost
[diagnostic] stuck session: sessionId=41befe2c-8c01-4f75-a662-2d47eb34d7bc
sessionKey=agent:main:main state=processing age=41514s queueDepth=0
- First detected: ~10:00 AM 2026-05-04
- Age at detection: 11,514 seconds (~3.2 hours into a 11.5+ hour run)
- state=processing, queueDepth oscillates 0→1→0
- NOT in session store —
~/.openclaw/agents/main/sessions/ has no files matching *41befe2c*
- NOT in sessions_list —
sessions_list({search:"41befe2c"}) returns count=0
- NOT in sessions.json — no entry in store metadata
2. Event Loop Degradation
[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
cpuCoreRatio=1.289 active=1 waiting=0 queued=5
- P99 event loop delay spiked to 12,876ms (12.8 seconds)
- Event loop utilization hitting 0.998 (99.8%)
- CPU core ratio at 1.289 with 5 queued requests
- Gateway process (PID 23979): 48.7% sustained CPU, 1.5GB RSS
3. CLI Universal Timeout
openclaw cron list → SIGKILL
openclaw cron rm → SIGKILL
openclaw gateway status → SIGKILL
openclaw sessions list → hanging
- All CLI commands became non-functional
4. Exec Timeout Cascades
[tools] exec failed: TIMEOUT: node invoke timed out
raw_params={"command":"sleep 300 && cat /tmp/...progress.json","host":"node","timeout":310}
Subagents calling exec({host:"node"}) with sleep-based polling hit repeated timeouts, further blocking the event loop.
Root Cause Analysis
The stuck session likely originates from repeated openclaw cron CLI invocations that timed out during cron list/rm operations. Each CLI invocation creates a Gateway operator session. When multiple CLI calls timeout simultaneously, the session lifecycle cleanup may fail to remove the session from the Gateway's internal state tracking, even though the session store is properly cleaned.
Key observations:
- Session exists only in memory (diagnostic system) but not on disk (store)
- The diagnostic stuck session checker runs every 30 seconds and detects it, likely doing a tight-loop scan each time
- The Gateway restart (clearing in-memory state) resolved the issue immediately
Reproduction Steps
- Run multiple
openclaw cron list/rm CLI commands rapidly during high Gateway load
- Wait for CLI timeout (SIGKILL)
- Observe
diagnostic stuck session entries persisting in gateway.err.log
- Observe declining event loop health and rising CPU
Expected Behavior
- Stale session references should be cleaned from the in-memory state tracker within a reasonable timeout (e.g., 5 minutes)
- Session lifecycle cleanup should handle SIGKILL/timeout scenarios gracefully
- The diagnostic stuck session detector should not tight-loop on the same ghost session indefinitely
Actual Behavior
- Ghost session persisted 11.5+ hours until Gateway restart
- Sustained 48% CPU for 11+ hours
- CLI became non-functional
- No automatic recovery
Fix
Gateway restart cleared all in-memory state and restored normal operation (CPU dropped to normal, CLI responsive).
Suggested code fix:
- Add a max-age threshold for stuck session tracking (e.g., if a session has been "stuck" in processing state for > 10 minutes and doesn't exist in the session store, evict it from the state tracker)
- Add exponential backoff to the stuck-session checker to avoid tight-looping on the same stuck reference
- Ensure session cleanup on CLI timeout/SIGKILL scenarios covers both store AND in-memory state
Impact
- Severity: High (causes Gateway degradation, blocks agent operation)
- Frequency: Triggered 1x so far (correlated with CLI timeout during cron operations)
- Recovery: Only Gateway restart resolves
Summary
A zombie/stuck session (
sessionId=41befe2c-) persists in the Gateway's in-memory state machine for 11.5+ hours while absent from the session store. The diagnostic system repeatedly detects it as stuck, causing sustained 48% CPU, event loop P99 delays up to 12.8s, and CLI commands receiving SIGKILL.Environment
Symptoms
1. Stuck Session Ghost
~/.openclaw/agents/main/sessions/has no files matching*41befe2c*sessions_list({search:"41befe2c"})returns count=02. Event Loop Degradation
3. CLI Universal Timeout
openclaw cron list→ SIGKILLopenclaw cron rm→ SIGKILLopenclaw gateway status→ SIGKILLopenclaw sessions list→ hanging4. Exec Timeout Cascades
Subagents calling
exec({host:"node"})with sleep-based polling hit repeated timeouts, further blocking the event loop.Root Cause Analysis
The stuck session likely originates from repeated
openclaw cronCLI invocations that timed out during cron list/rm operations. Each CLI invocation creates a Gateway operator session. When multiple CLI calls timeout simultaneously, the session lifecycle cleanup may fail to remove the session from the Gateway's internal state tracking, even though the session store is properly cleaned.Key observations:
Reproduction Steps
openclaw cron list/rmCLI commands rapidly during high Gateway loaddiagnostic stuck sessionentries persisting in gateway.err.logExpected Behavior
Actual Behavior
Fix
Gateway restart cleared all in-memory state and restored normal operation (CPU dropped to normal, CLI responsive).
Suggested code fix:
Impact