Skip to content

Message runs interrupted by network errors are not retried, causing silent message loss #9208

@MuhsinunC

Description

@MuhsinunC

Summary

When a message processing run is interrupted by a network error (e.g., TLS connection failure), the run is silently dropped and never retried. The user's message goes unanswered with no notification or recovery.

Steps to Reproduce

  1. Send a message via Telegram (or any channel)
  2. While the agent is processing (making API calls to the model provider), a network error occurs
  3. The in-progress run is interrupted and lost
  4. The Telegram channel restarts, but the message is never retried

Observed Behavior

From gateway logs on 2026-02-05:

00:32:27 - embedded run start: runId=15f58535... sessionId=cdd2ca67... messageChannel=telegram
00:32:40 - embedded run start: runId=6db4c19d... sessionId=ba705afb... messageChannel=telegram
00:33:02 - [openclaw] Uncaught exception: TypeError: Cannot read properties of null (reading 'setSession')
           at TLSSocket.setSession (node:_tls_wrap:1132:16)
           at Object.connect (node:_tls_wrap:1826:13)
           at Client.connect (.../undici@7.20.0/node_modules/undici/lib/core/connect.js:70:20)
00:33:09 - [default] starting provider  (Telegram channel restarted)

Key observations:

  • Two message runs were in progress when the TLS error occurred
  • Neither run has a corresponding run_completed log entry
  • The Telegram channel restarted automatically
  • But the two interrupted runs were never retried
  • User messages went unanswered for 10+ minutes until manual gateway restart

Expected Behavior

  1. When a run fails due to a transient network error, it should be retried (with exponential backoff)
  2. If retry fails after N attempts, the message should be moved to a dead-letter queue
  3. User should receive a notification that their message couldn't be processed
  4. At minimum, a warning should be logged when runs fail without completion

Proposed Solutions

Option 1: Automatic Retry

  • Track in-progress runs with their original message payload
  • On network error, re-enqueue the message for retry
  • Use exponential backoff (e.g., 1s, 2s, 4s, max 3 retries)

Option 2: Dead-Letter Queue

  • Failed messages are stored in a persistent queue
  • Agent can be configured to notify user of failed messages
  • Admin can manually retry or inspect failed messages

Option 3: Health-Check Recovery

  • Periodically check for runs that started but never completed
  • If a run is "stuck" for > N minutes, attempt recovery

Environment

  • OpenClaw version: 2026.2.2
  • Platform: macOS (Darwin 25.2.0)
  • Node.js: 22.22.0
  • Channel: Telegram (polling mode)
  • Model provider: Custom Anthropic proxy

Workaround

Currently, the only workaround is to manually restart the gateway when messages go unanswered:

launchctl kickstart -k gui/$UID/ai.openclaw.gateway

Impact

  • Severity: High - users lose messages with no indication of failure
  • Frequency: Rare (TLS errors are uncommon) but impactful when it happens
  • User experience: Very poor - messages silently disappear

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions