-
-
Notifications
You must be signed in to change notification settings - Fork 52.8k
Closed as not planned
Closed as not planned
Copy link
Labels
bugSomething isn't workingSomething isn't workingstaleMarked as stale due to inactivityMarked as stale due to inactivity
Description
Summary
When a message processing run is interrupted by a network error (e.g., TLS connection failure), the run is silently dropped and never retried. The user's message goes unanswered with no notification or recovery.
Steps to Reproduce
- Send a message via Telegram (or any channel)
- While the agent is processing (making API calls to the model provider), a network error occurs
- The in-progress run is interrupted and lost
- The Telegram channel restarts, but the message is never retried
Observed Behavior
From gateway logs on 2026-02-05:
00:32:27 - embedded run start: runId=15f58535... sessionId=cdd2ca67... messageChannel=telegram
00:32:40 - embedded run start: runId=6db4c19d... sessionId=ba705afb... messageChannel=telegram
00:33:02 - [openclaw] Uncaught exception: TypeError: Cannot read properties of null (reading 'setSession')
at TLSSocket.setSession (node:_tls_wrap:1132:16)
at Object.connect (node:_tls_wrap:1826:13)
at Client.connect (.../undici@7.20.0/node_modules/undici/lib/core/connect.js:70:20)
00:33:09 - [default] starting provider (Telegram channel restarted)
Key observations:
- Two message runs were in progress when the TLS error occurred
- Neither run has a corresponding
run_completedlog entry - The Telegram channel restarted automatically
- But the two interrupted runs were never retried
- User messages went unanswered for 10+ minutes until manual gateway restart
Expected Behavior
- When a run fails due to a transient network error, it should be retried (with exponential backoff)
- If retry fails after N attempts, the message should be moved to a dead-letter queue
- User should receive a notification that their message couldn't be processed
- At minimum, a warning should be logged when runs fail without completion
Proposed Solutions
Option 1: Automatic Retry
- Track in-progress runs with their original message payload
- On network error, re-enqueue the message for retry
- Use exponential backoff (e.g., 1s, 2s, 4s, max 3 retries)
Option 2: Dead-Letter Queue
- Failed messages are stored in a persistent queue
- Agent can be configured to notify user of failed messages
- Admin can manually retry or inspect failed messages
Option 3: Health-Check Recovery
- Periodically check for runs that started but never completed
- If a run is "stuck" for > N minutes, attempt recovery
Environment
- OpenClaw version: 2026.2.2
- Platform: macOS (Darwin 25.2.0)
- Node.js: 22.22.0
- Channel: Telegram (polling mode)
- Model provider: Custom Anthropic proxy
Workaround
Currently, the only workaround is to manually restart the gateway when messages go unanswered:
launchctl kickstart -k gui/$UID/ai.openclaw.gatewayImpact
- Severity: High - users lose messages with no indication of failure
- Frequency: Rare (TLS errors are uncommon) but impactful when it happens
- User experience: Very poor - messages silently disappear
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleMarked as stale due to inactivityMarked as stale due to inactivity