fix(agent): retry transient compression summary transport errors by ARegalado1 · Pull Request #16587 · NousResearch/hermes-agent

ARegalado1 · 2026-04-27T14:10:07Z

Summary

Retry context compression summary generation once for transient auxiliary transport, stream, timeout, and 5xx failures before falling back.
Preserve existing no-retry behavior for auth and other non-408 client 4xx errors.
Add regression coverage for incomplete chunked reads, generic APIConnectionError("Connection error."), transient HTTP status-code paths, auth/client no-retry, and second transient failure cooldown.

Context

I hit a transient compression-summary failure where the auxiliary call reported peer closed connection without sending complete message body (incomplete chunked read). Today that path goes straight to fallback/cooldown, even though a single retry is often enough for stream/transport disconnects.

This is intentionally narrower than the open Responses API role=tool replay work around #5709. It does not change message conversion or tool replay behavior.

Refresh note: I rebased this branch onto current main, checked for duplicate/superseding PRs or issues for this exact compressor retry path, and added direct coverage for the status-code branches.

Test Plan

python -m py_compile agent/context_compressor.py
python -m pytest tests/agent/test_context_compressor.py -q -o 'addopts=' — 73 passed
git diff --check upstream/main...HEAD

ARegalado1 · 2026-05-03T14:36:40Z

Refresh update:

Rebased this PR branch onto current main.
Added direct regression coverage for transient HTTP status-code paths (408, 5xx, including response.status_code).
Rechecked for duplicate/superseding PRs or issues for this exact compressor retry path and did not find any.
Verified locally:
- python -m py_compile agent/context_compressor.py
- python -m pytest tests/agent/test_context_compressor.py -q -o 'addopts=' — 73 passed
- git diff --check upstream/main...HEAD

No CI checks are currently attached to this fork PR from GitHub.

teknium1 · 2026-06-08T05:19:19Z

Fixed on main via #41714 (commit 2789bf4).

The auxiliary Codex Responses path now routes through the same shared converter the main agent transport uses (_chat_messages_to_responses_input), so tool history is encoded as function_call / function_call_output items with a valid call_id instead of leaking raw role=tool into Responses input[]. Single conversion path means the auxiliary and main paths can't drift apart again — which was the explicit scope of #5709.

Thanks for the fix — closing as resolved by #41714.

teknium1 · 2026-06-08T05:19:46Z

Reopening — I closed this in error. This PR is independent of the #5709 role=tool Responses conversion fix (#41714); it adds single-retry handling for transient compression-summary transport failures (incomplete chunked reads, stream disconnects, 5xx) in agent/context_compressor.py, which my change does not touch or supersede. Apologies for the mistaken close. This stands on its own and will be triaged separately on its merits.

…16587) A one-off transient transport failure (streaming-close / incomplete chunked read / 5xx / 408) on an auxiliary LLM call escalated straight to provider/model fallback (or, for context compression, dropped the summary and entered cooldown), even when an immediate retry on the same provider would have succeeded. Add a single same-target retry at the top of call_llm() and async_call_llm() — before the existing except-chain — gated on a new _is_transient_transport_error() that reuses the canonical _is_connection_error() detector plus a 5xx/408 status check. A second failure (or any non-transient error: auth, other 4xx, malformed payload) falls through to first_err and the existing fallback handling unchanged. This lives in call_llm so every auxiliary task (compression, memory flush, title generation, session search, vision) shares one transient-retry surface, rather than each caller re-implementing it. The context compressor needs no change — it calls call_llm and inherits the retry; its existing fallback-to-main path (#18458) now composes naturally (retry the aux model once, then fall back to main only if the retry also fails). Co-authored-by: ARegalado1 <alberto.regalado@ymail.com>

teknium1 · 2026-06-08T08:06:37Z

Merged via #41885 (commit 02a4d66).

Your fix idea — retry a transient compression-summary transport error once before giving up — shipped, but generalized one layer down. Instead of retrying inside the context-compression caller with a private error classifier, the retry now lives in call_llm() / async_call_llm() (a single same-target retry before the existing fallback chain), gated on the canonical _is_connection_error() detector + a 5xx/408 check. That way every auxiliary task — compression, memory flush, title generation, session search, vision — gets the same transient resilience, and there's no duplicate error list to drift.

The context compressor needed no change after that: it calls call_llm and inherits the retry, and its existing fallback-to-main path composes naturally (retry the aux model once, then fall back to main only if the retry also fails).

Your authorship is preserved as a co-author on the merged commit, and you're in the release AUTHOR_MAP. Thanks for the report and the fix.

…ousResearch#16587) A one-off transient transport failure (streaming-close / incomplete chunked read / 5xx / 408) on an auxiliary LLM call escalated straight to provider/model fallback (or, for context compression, dropped the summary and entered cooldown), even when an immediate retry on the same provider would have succeeded. Add a single same-target retry at the top of call_llm() and async_call_llm() — before the existing except-chain — gated on a new _is_transient_transport_error() that reuses the canonical _is_connection_error() detector plus a 5xx/408 status check. A second failure (or any non-transient error: auth, other 4xx, malformed payload) falls through to first_err and the existing fallback handling unchanged. This lives in call_llm so every auxiliary task (compression, memory flush, title generation, session search, vision) shares one transient-retry surface, rather than each caller re-implementing it. The context compressor needs no change — it calls call_llm and inherits the retry; its existing fallback-to-main path (NousResearch#18458) now composes naturally (retry the aux model once, then fall back to main only if the retry also fails). Co-authored-by: ARegalado1 <alberto.regalado@ymail.com>

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 27, 2026

fix(agent): retry transient compression summary transport errors

0312f72

ARegalado1 force-pushed the fix/aux-alerts-2026-04-27 branch from 7061178 to 0312f72 Compare May 3, 2026 14:35

alt-glitch mentioned this pull request May 14, 2026

fix(context): retry summary transport failures with reduced prompt #25370

Closed

teknium1 closed this Jun 8, 2026

teknium1 reopened this Jun 8, 2026

teknium1 mentioned this pull request Jun 8, 2026

fix(auxiliary): retry transient transport error once before fallback (#16587) #41885

Merged

teknium1 closed this Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): retry transient compression summary transport errors#16587

fix(agent): retry transient compression summary transport errors#16587
ARegalado1 wants to merge 1 commit into
NousResearch:mainfrom
ARegalado1:fix/aux-alerts-2026-04-27

ARegalado1 commented Apr 27, 2026 •

edited

Loading

Uh oh!

ARegalado1 commented May 3, 2026

Uh oh!

teknium1 commented Jun 8, 2026

Uh oh!

teknium1 commented Jun 8, 2026

Uh oh!

teknium1 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ARegalado1 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test Plan

Uh oh!

ARegalado1 commented May 3, 2026

Uh oh!

teknium1 commented Jun 8, 2026

Uh oh!

teknium1 commented Jun 8, 2026

Uh oh!

teknium1 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARegalado1 commented Apr 27, 2026 •

edited

Loading