Skip to content

[Bug] Codex APIConnectionError retry rate ~8x higher post-v0.13.0; persists with #12953 applied; suspected commit 5533ad764 strict stream-timeout enforcement #22986

@QuarkAssistant

Description

@QuarkAssistant

Summary

After upgrading to v0.13.0 (v2026.5.7), the APIConnectionError: Connection error. retry rate against chatgpt.com/backend-api/codex increased ~8x against the same workload. The retries persist even with PR #12953 (custom keepalive transport bypass) cherry-picked locally. Suspect commit 5533ad764 fix(auxiliary): enforce Codex Responses stream timeout is a contributing cause via a too-tight 120s default.

Environment

Empirical data

Logs counted across ~/.hermes/logs/agent.log, ~/.hermes/profiles/*/logs/agent.log, and ~/.hermes/kanban/logs/t_*.log:

  • 2026-05-08 (pre-upgrade, full day): 21 retries.
  • 2026-05-09 12:00–22:00 CDT (post-upgrade): 171+ retries.
  • Same user, same workload class (Telegram-driven agent interactions plus an internal multi-agent workload), same network/IP.

Hourly post-upgrade peaks: 33–48 retries/hour. Pre-upgrade comparable: 0–2 retries/hour.

Symptom

Every retry signature in the post-upgrade window:

⚠️  API call failed (attempt 1/3): APIConnectionError
   🔌 Provider: openai-codex  Model: gpt-5.5
   🌐 Endpoint: https://chatgpt.com/backend-api/codex
   📝 Error: Connection error.
   ⏱️  Elapsed: 0.13–0.5s  Context: varies
⏳ Retrying in 2-5s (attempt 1/3)...

Sub-second elapsed means TLS-handshake-time RST. ~95%+ of retries succeed within 1–3 attempts (no max-retries-exhausted events).

Hypothesis

Commit 5533ad764 enforces a hard total-elapsed timeout on the Codex Responses auxiliary stream. The default auxiliary.compression.timeout: 120s is too tight for compression workloads where 200K+ token sessions on gpt-5.4-mini routinely take 60–180s. Pre-commit, slow streams completed; post-commit, they timeout at 120s, raise TimeoutError, classify as retryable, and the outer agent loop fires a retry cycle that often surfaces as a transport-level APIConnectionError due to the forcible client.close() in _close_client_on_timeout.

Workaround

In ~/.hermes/config.yaml:

auxiliary:
  compression:
    timeout: 300   # was 120

load_config() is mtime-cached so no restart needed. Reduces the compression-driven retry class. Connection-class retries (covered by #12952) are separate.

Related

This issue is for the post-v0.13.0 amplification specifically, not the underlying transport instability that pre-existed.

Asks

  1. Confirm 5533ad764's 120s default is intended for production Codex-OAuth-on-ChatGPT-account workloads or should be tuned higher.
  2. Consider exposing a per-call override or auto-scaling the deadline based on observed stream throughput.
  3. Either way, document the workaround for users hitting the regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions