Skip to content

[Bug]: Agent processing lane can stall for minutes without timeout recovery, plus memory-core dreaming cron race condition on Gateway restart #73581

@WS-Q0758

Description

@WS-Q0758

Environment

  • OpenClaw Version: v2026.4.26 (be8c246)
  • Node: v22.22.2
  • OS: Linux 6.17.0-22-generic (Ubuntu/x64)
  • Channel: Feishu (WebSocket mode)
  • Model: bailian/qwen3.6-plus
  • Config: systemd user service with Restart=always

Issue 1: Agent Processing Lane Stalls Without Timeout Recovery

Symptom

The main agent session periodically enters a stuck session state where state=processing persists for 2-4+ minutes with queueDepth=1. During this time, the session cannot process new messages. The Gateway itself remains alive and other sessions work fine.

Reproduction

This is intermittent but has occurred 6 times today under various conditions:

Time Session Key Duration Trigger
18:46 agent:main:main 241s WS timeout + send data failed
19:28 agent:main:main 256s write EPIPE (dashboard WS disconnect)
20:01 feishu session 136s Heavy web_fetch (multiple concurrent page loads)
20:08 feishu session 143→173s Multiple concurrent gh CLI exec commands
20:32 feishu session 163s Post-restart cold start (first message after Gateway restart)
20:40 feishu session 161s Still in recovery from previous restart

Diagnostic Log Evidence

{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=241s queueDepth=1","time":"2026-04-28T18:46:20.982+08:00"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=256s queueDepth=1","time":"2026-04-28T19:28:22.921+08:00"}

Preceding errors (both cases):

[error]: [ '[ws]', 'write EPIPE' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1 times")' ]

Analysis

  • Not a model issue: Occurs with both infini/minimax-m2.7 and bailian/qwen3.6-plus
  • Not a network issue: Feishu WebSocket remains connected; messages are received but never dispatched
  • Triggered by: Heavy concurrent tool use, WebSocket instability, post-restart cold start
  • Root cause hypothesis: The agent processing lane lacks a timeout/failover mechanism. When a tool call or LLM request hangs (or the WebSocket connection to the internal control UI drops mid-stream), the lane remains in processing state indefinitely. The diagnostic watchdog detects it but does not automatically recover the stuck session.

Current Mitigation

The user must restart the Gateway via systemd (systemctl --user restart openclaw-gateway.service) to clear stuck sessions. This is disruptive because it drops all active connections.

Requested Fix

  1. Lane-level timeout: Add a configurable timeout for agent processing lanes (e.g., 60-120s). If a lane exceeds this, forcibly release it and mark the session as error/ready.
  2. Automatic recovery: When the diagnostic watchdog detects a stuck session, attempt automatic lane reset instead of just logging a warning.
  3. Graceful error injection: If the underlying cause is a WebSocket disconnect, inject a synthetic error into the agent turn so it can complete rather than hanging.

Note on Related Issues

I found related but distinct issues:

This issue is broader: any long-running lane operation can stall the session, and there is no automatic recovery path.


Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart

Symptom

memory-core: managed dreaming cron could not be reconciled (cron service unavailable).

This occurs during Gateway startup/restart. The memory-core plugin attempts to register its managed dreaming cron jobs, but the OpenClaw cron service is not yet ready.

Reproduction

  1. Restart Gateway: systemctl --user restart openclaw-gateway.service
  2. Check logs ~7 minutes later for the warning

Analysis

This is a race condition between plugin initialization and cron service startup:

  • memory-core plugin initializes during Gateway boot and immediately attempts to reconcile managed cron jobs
  • The cron service takes longer to become fully available
  • The plugin's initial registration attempt fails
  • Even though cron becomes available later, the plugin does not appear to retry

Impact

The dreaming system (automatic memory consolidation at 3:00 AM) does not run after a Gateway restart. Manual cron jobs created by the user still work fine.

Requested Fix

The memory-core plugin should either:

  1. Delay cron registration until the cron service reports ready, or
  2. Implement a retry/backoff mechanism for cron job reconciliation

Workaround Applied

Reduced agents.defaults.compaction.timeoutSeconds from 900 to 300 to limit the maximum stall duration, but this does not address the root cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions