Environment
- OpenClaw Version: v2026.4.26 (be8c246)
- Node: v22.22.2
- OS: Linux 6.17.0-22-generic (Ubuntu/x64)
- Channel: Feishu (WebSocket mode)
- Model: bailian/qwen3.6-plus
- Config: systemd user service with
Restart=always
Issue 1: Agent Processing Lane Stalls Without Timeout Recovery
Symptom
The main agent session periodically enters a stuck session state where state=processing persists for 2-4+ minutes with queueDepth=1. During this time, the session cannot process new messages. The Gateway itself remains alive and other sessions work fine.
Reproduction
This is intermittent but has occurred 6 times today under various conditions:
| Time |
Session Key |
Duration |
Trigger |
| 18:46 |
agent:main:main |
241s |
WS timeout + send data failed |
| 19:28 |
agent:main:main |
256s |
write EPIPE (dashboard WS disconnect) |
| 20:01 |
feishu session |
136s |
Heavy web_fetch (multiple concurrent page loads) |
| 20:08 |
feishu session |
143→173s |
Multiple concurrent gh CLI exec commands |
| 20:32 |
feishu session |
163s |
Post-restart cold start (first message after Gateway restart) |
| 20:40 |
feishu session |
161s |
Still in recovery from previous restart |
Diagnostic Log Evidence
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=241s queueDepth=1","time":"2026-04-28T18:46:20.982+08:00"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=256s queueDepth=1","time":"2026-04-28T19:28:22.921+08:00"}
Preceding errors (both cases):
[error]: [ '[ws]', 'write EPIPE' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1 times")' ]
Analysis
- Not a model issue: Occurs with both
infini/minimax-m2.7 and bailian/qwen3.6-plus
- Not a network issue: Feishu WebSocket remains connected; messages are received but never dispatched
- Triggered by: Heavy concurrent tool use, WebSocket instability, post-restart cold start
- Root cause hypothesis: The agent processing lane lacks a timeout/failover mechanism. When a tool call or LLM request hangs (or the WebSocket connection to the internal control UI drops mid-stream), the lane remains in
processing state indefinitely. The diagnostic watchdog detects it but does not automatically recover the stuck session.
Current Mitigation
The user must restart the Gateway via systemd (systemctl --user restart openclaw-gateway.service) to clear stuck sessions. This is disruptive because it drops all active connections.
Requested Fix
- Lane-level timeout: Add a configurable timeout for agent processing lanes (e.g., 60-120s). If a lane exceeds this, forcibly release it and mark the session as error/ready.
- Automatic recovery: When the diagnostic watchdog detects a stuck session, attempt automatic lane reset instead of just logging a warning.
- Graceful error injection: If the underlying cause is a WebSocket disconnect, inject a synthetic error into the agent turn so it can complete rather than hanging.
Note on Related Issues
I found related but distinct issues:
This issue is broader: any long-running lane operation can stall the session, and there is no automatic recovery path.
Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart
Symptom
memory-core: managed dreaming cron could not be reconciled (cron service unavailable).
This occurs during Gateway startup/restart. The memory-core plugin attempts to register its managed dreaming cron jobs, but the OpenClaw cron service is not yet ready.
Reproduction
- Restart Gateway:
systemctl --user restart openclaw-gateway.service
- Check logs ~7 minutes later for the warning
Analysis
This is a race condition between plugin initialization and cron service startup:
memory-core plugin initializes during Gateway boot and immediately attempts to reconcile managed cron jobs
- The cron service takes longer to become fully available
- The plugin's initial registration attempt fails
- Even though cron becomes available later, the plugin does not appear to retry
Impact
The dreaming system (automatic memory consolidation at 3:00 AM) does not run after a Gateway restart. Manual cron jobs created by the user still work fine.
Requested Fix
The memory-core plugin should either:
- Delay cron registration until the cron service reports ready, or
- Implement a retry/backoff mechanism for cron job reconciliation
Workaround Applied
Reduced agents.defaults.compaction.timeoutSeconds from 900 to 300 to limit the maximum stall duration, but this does not address the root cause.
Environment
Restart=alwaysIssue 1: Agent Processing Lane Stalls Without Timeout Recovery
Symptom
The main agent session periodically enters a
stuck sessionstate wherestate=processingpersists for 2-4+ minutes withqueueDepth=1. During this time, the session cannot process new messages. The Gateway itself remains alive and other sessions work fine.Reproduction
This is intermittent but has occurred 6 times today under various conditions:
agent:main:mainagent:main:mainghCLI exec commandsDiagnostic Log Evidence
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=241s queueDepth=1","time":"2026-04-28T18:46:20.982+08:00"} {"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=256s queueDepth=1","time":"2026-04-28T19:28:22.921+08:00"}Preceding errors (both cases):
Analysis
infini/minimax-m2.7andbailian/qwen3.6-plusprocessingstate indefinitely. The diagnostic watchdog detects it but does not automatically recover the stuck session.Current Mitigation
The user must restart the Gateway via systemd (
systemctl --user restart openclaw-gateway.service) to clear stuck sessions. This is disruptive because it drops all active connections.Requested Fix
Note on Related Issues
I found related but distinct issues:
This issue is broader: any long-running lane operation can stall the session, and there is no automatic recovery path.
Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart
Symptom
This occurs during Gateway startup/restart. The
memory-coreplugin attempts to register its managed dreaming cron jobs, but the OpenClaw cron service is not yet ready.Reproduction
systemctl --user restart openclaw-gateway.serviceAnalysis
This is a race condition between plugin initialization and cron service startup:
memory-coreplugin initializes during Gateway boot and immediately attempts to reconcile managed cron jobsImpact
The dreaming system (automatic memory consolidation at 3:00 AM) does not run after a Gateway restart. Manual cron jobs created by the user still work fine.
Requested Fix
The
memory-coreplugin should either:Workaround Applied
Reduced
agents.defaults.compaction.timeoutSecondsfrom900to300to limit the maximum stall duration, but this does not address the root cause.