Skip to content

fix(compression): retry transient transport errors before fallback marker (#16670)#16737

Open
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:fix/16670-compression-incomplete-chunked-retry
Open

fix(compression): retry transient transport errors before fallback marker (#16670)#16737
Tranquil-Flow wants to merge 2 commits into
NousResearch:mainfrom
Tranquil-Flow:fix/16670-compression-incomplete-chunked-retry

Conversation

@Tranquil-Flow

Copy link
Copy Markdown
Contributor

What does this PR do?

Auxiliary compression's call_llm request can hit RemoteProtocolError ("peer closed connection without sending complete message body (incomplete chunked read)") mid-stream when the auxiliary endpoint hiccups. _generate_summary's generic except block in agent/context_compressor.py treated this like any other failure: 60-second cooldown, drop the selected turns, insert a fallback context marker, repeat on the next compaction. Long Telegram sessions surface this often enough that real context is permanently lost.

This PR adds a bounded in-call retry only for fast-fail mid-stream disconnect/protocol classes. Timeout-class errors are deliberately excluded (see Changes Made for why) so a genuinely slow endpoint can't stall a compaction for multiple timeout windows in a row.

Related Issue

Fixes #16670

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • agent/error_classifier.py — new is_transient_transport_error(error) predicate plus a narrow _DISCONNECT_TRANSPORT_TYPES registry. Returns True only for fast-fail mid-stream disconnect/protocol classes (RemoteProtocolError, ConnectError, ServerDisconnectedError, ConnectionResetError, ConnectionAbortedError, BrokenPipeError, ReadError, APIConnectionError, the SSL transport types). Status-coded errors (4xx/5xx) and all timeout classes (TimeoutError, ReadTimeout, ConnectTimeout, PoolTimeout, APITimeoutError) are explicitly NOT transient — retrying a timeout pays the full timeout window again, and against the 120 s compression default that would turn one missed compaction into a ~6 minute stalled response. Timeouts continue to use the existing cooldown path.
  • agent/context_compressor.py — wraps the single call_llm invocation in _generate_summary with a 1 + 2 retry loop (1 s / 3 s back-offs) gated on is_transient_transport_error. Non-transient errors fall straight through to the existing cooldown / model-fallback logic — 401/404/timeout handling is unchanged.
  • tests/agent/test_error_classifier.py — added TestIsTransientTransportError covering disconnect strings (incomplete chunked read), RemoteProtocolError, ConnectionError, the HTTP-status rejection, and explicit non-retry coverage for TimeoutError, ReadTimeout, APITimeoutError, ConnectTimeout, PoolTimeout.
  • tests/agent/test_context_compressor.py — added TestSummaryTransientRetry covering retry-then-succeed for both incomplete chunked read strings and RemoteProtocolError-named classes, retries-exhausted falling through to cooldown, non-transient HTTP errors not retrying, and TimeoutError not entering the retry loop (single call, cooldown still set).

How to Test

Reproducing the original bug requires a flaky auxiliary endpoint, but the failure mode is identical to the issue's repro: the agent's compaction logs Failed to generate context summary: peer closed connection without sending complete message body (incomplete chunked read). Further summary attempts paused for 60 seconds. followed by ⚠ Compression summary failed: ... user-visible warnings. After this PR, transient mid-stream disconnects retry up to two times before that warning fires.

Automated:

pytest tests/agent/test_context_compressor.py tests/agent/test_error_classifier.py -q

Result on macOS 15.6.1 / Python 3.14.2: 190 passed. Reviewer also ran the same command in a separate checkout and reported 190 passed.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.6.1 (Python 3.14.2)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A (docstring on the new helper documents the timeout-exclusion rationale)
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A (N/A — no config keys touched; retry budget tuned via existing module-level constants)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A (N/A)
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A (no platform-specific syscalls; relies only on time.sleep and existing transport-error registries)
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A (N/A — internal compressor, not a tool)

Screenshots / Logs

$ pytest tests/agent/test_context_compressor.py tests/agent/test_error_classifier.py -q
........................................................................ [ 37%]
........................................................................ [ 75%]
...............................................                          [100%]
190 passed, 190 warnings in 15.26s

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #16587 — both address transient transport error retry in context compression. This PR appears to be a more refined version with explicit error classifier and narrower retry scope.

@Tranquil-Flow Tranquil-Flow force-pushed the fix/16670-compression-incomplete-chunked-retry branch from 59f9e75 to 5b198e8 Compare May 25, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compression fallback marker after incomplete chunked read loses useful context in long sessions

2 participants