Skip to content

fix(base): retry transient send failures and notify user on exhaustion#3108

Closed
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/send-retry-on-network-error
Closed

fix(base): retry transient send failures and notify user on exhaustion#3108
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/send-retry-on-network-error

Conversation

@Mibayy

@Mibayy Mibayy commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Fixes #2910

Problem

When send() fails due to a network error (ConnectError, ReadTimeout, etc.), the failure was silently logged and the user received no feedback — appearing as a hang or crash. In the reported case, a user waited 1+ hour for a response that had already been generated but failed to deliver.

Fix

Two additions to BasePlatformAdapter:

_is_retryable_error(error) — detects transient network failures by matching substrings (connecterror, timeout, connectionreset, broken pipe, etc.)

_send_with_retry() — wraps send() with three code paths:

Error type Behaviour
Success Returns immediately, no overhead
Transient (network) Retries up to 2x with exponential backoff + jitter. On exhaustion, sends user a delivery-failure notice.
Permanent (formatting, permission) Falls back to plain-text version once, no retry loop.

handle_message() now calls _send_with_retry() instead of send() directly. All existing adapters benefit automatically — no per-adapter changes needed.

Platform adapters can also set SendResult.retryable=True for platform-specific transient errors that don't match string patterns.

User experience

Before: network blip during send → silent failure, user waits indefinitely

After: network blip → 2 automatic retries (2s, 4s backoff) → if still failing:

⚠️ Message delivery failed after multiple attempts. Please try again — your request was processed but the response could not be sent.

Tests

26 new tests in tests/gateway/test_send_retry.py covering all paths. 1431 existing tests pass.

Fixes silent delivery failures where a network error during send()
left the user with no feedback, appearing as a hang or crash.

## Changes

BasePlatformAdapter now has two new helpers:

_is_retryable_error(error: str) -> bool
  Detects transient network errors by matching known substrings
  (ConnectError, timeout, ConnectionReset, BrokenPipe, etc.)

_send_with_retry(chat_id, content, ..., max_retries=2, base_delay=2.0)
  - Success on first attempt: returns immediately (no overhead)
  - Transient error (network): retries up to max_retries times with
    exponential backoff + jitter. On exhaustion, sends the user a
    delivery-failure notice so they know to retry rather than waiting.
  - Permanent error (formatting/permission): falls back to plain-text
    version immediately, without entering the retry loop.
  - SendResult.retryable=True respected for platform-specific retryable
    errors that don't match string patterns.

handle_message() now calls _send_with_retry() instead of send() directly.

## User experience

Before: network blip during send → silent failure, user waits 1+ hour
After:  network blip → 2 automatic retries → if still failing, user
        receives '⚠️ Message delivery failed after multiple attempts.
        Please try again — your request was processed but the response
        could not be sent.'

Closes NousResearch#2910
teknium1 pushed a commit that referenced this pull request Mar 26, 2026
…tion

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR #3108 by Mibayy.
teknium1 added a commit that referenced this pull request Mar 27, 2026
…tion (#3288)

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR #3108 by Mibayy.

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #3288. Your commit was cherry-picked onto current main with authorship preserved. A few improvements on top: removed unused event param, hoisted import random to module level, fixed a subtle for/else logic bug where error transitioning from network to non-network mid-retry would send a misleading delivery-failure notice instead of the plain-text fallback, and added a test for that path. 27 tests total. Thanks for the contribution!

@teknium1 teknium1 closed this Mar 27, 2026
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…tion (NousResearch#3288)

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (NousResearch#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR NousResearch#3108 by Mibayy.

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…tion (NousResearch#3288)

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (NousResearch#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR NousResearch#3108 by Mibayy.

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…tion (NousResearch#3288)

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (NousResearch#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR NousResearch#3108 by Mibayy.

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…tion (NousResearch#3288)

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (NousResearch#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR NousResearch#3108 by Mibayy.

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…tion (NousResearch#3288)

When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (NousResearch#2910).

Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors

All adapters benefit automatically via BasePlatformAdapter inheritance.

Cherry-picked from PR NousResearch#3108 by Mibayy.

Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram message delivery failure not surfaced to user - appears as 'hang/crash'

2 participants