[Bug]: Agent processing lane can stall for minutes without timeout recovery, plus memory-core dreaming cron race condition on Gateway restart

## Environment

- **OpenClaw Version**: v2026.4.26 (be8c246)
- **Node**: v22.22.2
- **OS**: Linux 6.17.0-22-generic (Ubuntu/x64)
- **Channel**: Feishu (WebSocket mode)
- **Model**: bailian/qwen3.6-plus
- **Config**: systemd user service with `Restart=always`

---

## Issue 1: Agent Processing Lane Stalls Without Timeout Recovery

### Symptom

The main agent session periodically enters a `stuck session` state where `state=processing` persists for 2-4+ minutes with `queueDepth=1`. During this time, the session cannot process new messages. The Gateway itself remains alive and other sessions work fine.

### Reproduction

This is intermittent but has occurred **6 times today** under various conditions:

| Time | Session Key | Duration | Trigger |
|------|-------------|----------|---------|
| 18:46 | `agent:main:main` | 241s | WS timeout + send data failed |
| 19:28 | `agent:main:main` | 256s | write EPIPE (dashboard WS disconnect) |
| 20:01 | feishu session | 136s | Heavy web_fetch (multiple concurrent page loads) |
| 20:08 | feishu session | 143→173s | Multiple concurrent `gh` CLI exec commands |
| 20:32 | feishu session | 163s | Post-restart cold start (first message after Gateway restart) |
| 20:40 | feishu session | 161s | Still in recovery from previous restart |

### Diagnostic Log Evidence

```json
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=241s queueDepth=1","time":"2026-04-28T18:46:20.982+08:00"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=256s queueDepth=1","time":"2026-04-28T19:28:22.921+08:00"}
```

Preceding errors (both cases):
```
[error]: [ '[ws]', 'write EPIPE' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1 times")' ]
```

### Analysis

- **Not a model issue**: Occurs with both `infini/minimax-m2.7` and `bailian/qwen3.6-plus`
- **Not a network issue**: Feishu WebSocket remains connected; messages are received but never dispatched
- **Triggered by**: Heavy concurrent tool use, WebSocket instability, post-restart cold start
- **Root cause hypothesis**: The agent processing lane lacks a timeout/failover mechanism. When a tool call or LLM request hangs (or the WebSocket connection to the internal control UI drops mid-stream), the lane remains in `processing` state indefinitely. The diagnostic watchdog detects it but **does not automatically recover** the stuck session.

### Current Mitigation

The user must restart the Gateway via systemd (`systemctl --user restart openclaw-gateway.service`) to clear stuck sessions. This is disruptive because it drops all active connections.

### Requested Fix

1. **Lane-level timeout**: Add a configurable timeout for agent processing lanes (e.g., 60-120s). If a lane exceeds this, forcibly release it and mark the session as error/ready.
2. **Automatic recovery**: When the diagnostic watchdog detects a stuck session, attempt automatic lane reset instead of just logging a warning.
3. **Graceful error injection**: If the underlying cause is a WebSocket disconnect, inject a synthetic error into the agent turn so it can complete rather than hanging.

### Note on Related Issues

I found related but distinct issues:
- #53008: Compaction blocking main lane — different trigger (compaction), same symptom
- #68649: PDF tool hanging — tool-specific, not general lane stall
- #53889: Session deadlock with dangling toolCall — specific to toolCall/result mismatch
- #72810: Discord session routable after timeout — channel-specific

This issue is broader: **any long-running lane operation can stall the session**, and there is no automatic recovery path.

---

## Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart

### Symptom

```
memory-core: managed dreaming cron could not be reconciled (cron service unavailable).
```

This occurs during Gateway startup/restart. The `memory-core` plugin attempts to register its managed dreaming cron jobs, but the OpenClaw cron service is not yet ready.

### Reproduction

1. Restart Gateway: `systemctl --user restart openclaw-gateway.service`
2. Check logs ~7 minutes later for the warning

### Analysis

This is a **race condition** between plugin initialization and cron service startup:

- `memory-core` plugin initializes during Gateway boot and immediately attempts to reconcile managed cron jobs
- The cron service takes longer to become fully available
- The plugin's initial registration attempt fails
- Even though cron becomes available later, the plugin does not appear to retry

### Impact

The dreaming system (automatic memory consolidation at 3:00 AM) does not run after a Gateway restart. Manual cron jobs created by the user still work fine.

### Requested Fix

The `memory-core` plugin should either:
1. Delay cron registration until the cron service reports ready, or
2. Implement a retry/backoff mechanism for cron job reconciliation

---

## Workaround Applied

Reduced `agents.defaults.compaction.timeoutSeconds` from `900` to `300` to limit the maximum stall duration, but this does not address the root cause.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Agent processing lane can stall for minutes without timeout recovery, plus memory-core dreaming cron race condition on Gateway restart #73581

Environment

Issue 1: Agent Processing Lane Stalls Without Timeout Recovery

Symptom

Reproduction

Diagnostic Log Evidence

Analysis

Current Mitigation

Requested Fix

Note on Related Issues

Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart

Symptom

Reproduction

Analysis

Impact

Requested Fix

Workaround Applied

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time	Session Key	Duration	Trigger
18:46	`agent:main:main`	241s	WS timeout + send data failed
19:28	`agent:main:main`	256s	write EPIPE (dashboard WS disconnect)
20:01	feishu session	136s	Heavy web_fetch (multiple concurrent page loads)
20:08	feishu session	143→173s	Multiple concurrent `gh` CLI exec commands
20:32	feishu session	163s	Post-restart cold start (first message after Gateway restart)
20:40	feishu session	161s	Still in recovery from previous restart

Uh oh!

[Bug]: Agent processing lane can stall for minutes without timeout recovery, plus memory-core dreaming cron race condition on Gateway restart #73581

Description

Environment

Issue 1: Agent Processing Lane Stalls Without Timeout Recovery

Symptom

Reproduction

Diagnostic Log Evidence

Analysis

Current Mitigation

Requested Fix

Note on Related Issues

Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart

Symptom

Reproduction

Analysis

Impact

Requested Fix

Workaround Applied

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions