Stuck session ghost blocks event loop, causes CLI timeout and sustained high CPU (11.5hr+)

## Summary

A zombie/stuck session (`sessionId=41befe2c-`) persists in the Gateway's in-memory state machine for 11.5+ hours while absent from the session store. The diagnostic system repeatedly detects it as stuck, causing sustained 48% CPU, event loop P99 delays up to 12.8s, and CLI commands receiving SIGKILL.

## Environment

- **OpenClaw version**: 2026.4.27 (cbc2ba0)
- **Gateway**: macOS (Darwin 21.6.0, x64), node v22.22.0, port 18789
- **Model**: deepseek/deepseek-v4-pro (main), minimax-portal/MiniMax-M2.7 (subagents)
- **Setup**: MCP servers (playwright, qwen, kimi, glm, doubao x5 all stdio), Ubuntu node (192.168.1.18)

## Symptoms

### 1. Stuck Session Ghost
```
[diagnostic] stuck session: sessionId=41befe2c-8c01-4f75-a662-2d47eb34d7bc
  sessionKey=agent:main:main state=processing age=41514s queueDepth=0
```
- First detected: ~10:00 AM 2026-05-04
- Age at detection: 11,514 seconds (~3.2 hours into a 11.5+ hour run)
- state=processing, queueDepth oscillates 0→1→0
- **NOT in session store** — `~/.openclaw/agents/main/sessions/` has no files matching `*41befe2c*`
- **NOT in sessions_list** — `sessions_list({search:"41befe2c"})` returns count=0
- **NOT in sessions.json** — no entry in store metadata

### 2. Event Loop Degradation
```
[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
  cpuCoreRatio=1.289 active=1 waiting=0 queued=5
```
- P99 event loop delay spiked to 12,876ms (12.8 seconds)
- Event loop utilization hitting 0.998 (99.8%)
- CPU core ratio at 1.289 with 5 queued requests
- Gateway process (PID 23979): 48.7% sustained CPU, 1.5GB RSS

### 3. CLI Universal Timeout
- `openclaw cron list` → SIGKILL
- `openclaw cron rm` → SIGKILL
- `openclaw gateway status` → SIGKILL
- `openclaw sessions list` → hanging
- All CLI commands became non-functional

### 4. Exec Timeout Cascades
```
[tools] exec failed: TIMEOUT: node invoke timed out
  raw_params={"command":"sleep 300 && cat /tmp/...progress.json","host":"node","timeout":310}
```
Subagents calling `exec({host:"node"})` with sleep-based polling hit repeated timeouts, further blocking the event loop.

## Root Cause Analysis

The stuck session likely originates from repeated `openclaw cron` CLI invocations that timed out during cron list/rm operations. Each CLI invocation creates a Gateway operator session. When multiple CLI calls timeout simultaneously, the session lifecycle cleanup may fail to remove the session from the Gateway's internal state tracking, even though the session store is properly cleaned.

**Key observations:**
- Session exists only in memory (diagnostic system) but not on disk (store)
- The diagnostic stuck session checker runs every 30 seconds and detects it, likely doing a tight-loop scan each time
- The Gateway restart (clearing in-memory state) resolved the issue immediately

## Reproduction Steps

1. Run multiple `openclaw cron list/rm` CLI commands rapidly during high Gateway load
2. Wait for CLI timeout (SIGKILL)
3. Observe `diagnostic stuck session` entries persisting in gateway.err.log
4. Observe declining event loop health and rising CPU

## Expected Behavior

- Stale session references should be cleaned from the in-memory state tracker within a reasonable timeout (e.g., 5 minutes)
- Session lifecycle cleanup should handle SIGKILL/timeout scenarios gracefully
- The diagnostic stuck session detector should not tight-loop on the same ghost session indefinitely

## Actual Behavior

- Ghost session persisted 11.5+ hours until Gateway restart
- Sustained 48% CPU for 11+ hours
- CLI became non-functional
- No automatic recovery

## Fix

Gateway restart cleared all in-memory state and restored normal operation (CPU dropped to normal, CLI responsive).

**Suggested code fix:**
1. Add a max-age threshold for stuck session tracking (e.g., if a session has been "stuck" in processing state for > 10 minutes and doesn't exist in the session store, evict it from the state tracker)
2. Add exponential backoff to the stuck-session checker to avoid tight-looping on the same stuck reference
3. Ensure session cleanup on CLI timeout/SIGKILL scenarios covers both store AND in-memory state

## Impact

- **Severity**: High (causes Gateway degradation, blocks agent operation)
- **Frequency**: Triggered 1x so far (correlated with CLI timeout during cron operations)
- **Recovery**: Only Gateway restart resolves


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stuck session ghost blocks event loop, causes CLI timeout and sustained high CPU (11.5hr+) #77115

Summary

Environment

Symptoms

1. Stuck Session Ghost

2. Event Loop Degradation

3. CLI Universal Timeout

4. Exec Timeout Cascades

Root Cause Analysis

Reproduction Steps

Expected Behavior

Actual Behavior

Fix

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Stuck session ghost blocks event loop, causes CLI timeout and sustained high CPU (11.5hr+) #77115

Description

Summary

Environment

Symptoms

1. Stuck Session Ghost

2. Event Loop Degradation

3. CLI Universal Timeout

4. Exec Timeout Cascades

Root Cause Analysis

Reproduction Steps

Expected Behavior

Actual Behavior

Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions