Summary
After upgrading to v0.13.0 (v2026.5.7), the APIConnectionError: Connection error. retry rate against chatgpt.com/backend-api/codex increased ~8x against the same workload. The retries persist even with PR #12953 (custom keepalive transport bypass) cherry-picked locally. Suspect commit 5533ad764 fix(auxiliary): enforce Codex Responses stream timeout is a contributing cause via a too-tight 120s default.
Environment
Empirical data
Logs counted across ~/.hermes/logs/agent.log, ~/.hermes/profiles/*/logs/agent.log, and ~/.hermes/kanban/logs/t_*.log:
- 2026-05-08 (pre-upgrade, full day): 21 retries.
- 2026-05-09 12:00–22:00 CDT (post-upgrade): 171+ retries.
- Same user, same workload class (Telegram-driven agent interactions plus an internal multi-agent workload), same network/IP.
Hourly post-upgrade peaks: 33–48 retries/hour. Pre-upgrade comparable: 0–2 retries/hour.
Symptom
Every retry signature in the post-upgrade window:
⚠️ API call failed (attempt 1/3): APIConnectionError
🔌 Provider: openai-codex Model: gpt-5.5
🌐 Endpoint: https://chatgpt.com/backend-api/codex
📝 Error: Connection error.
⏱️ Elapsed: 0.13–0.5s Context: varies
⏳ Retrying in 2-5s (attempt 1/3)...
Sub-second elapsed means TLS-handshake-time RST. ~95%+ of retries succeed within 1–3 attempts (no max-retries-exhausted events).
Hypothesis
Commit 5533ad764 enforces a hard total-elapsed timeout on the Codex Responses auxiliary stream. The default auxiliary.compression.timeout: 120s is too tight for compression workloads where 200K+ token sessions on gpt-5.4-mini routinely take 60–180s. Pre-commit, slow streams completed; post-commit, they timeout at 120s, raise TimeoutError, classify as retryable, and the outer agent loop fires a retry cycle that often surfaces as a transport-level APIConnectionError due to the forcible client.close() in _close_client_on_timeout.
Workaround
In ~/.hermes/config.yaml:
auxiliary:
compression:
timeout: 300 # was 120
load_config() is mtime-cached so no restart needed. Reduces the compression-driven retry class. Connection-class retries (covered by #12952) are separate.
Related
This issue is for the post-v0.13.0 amplification specifically, not the underlying transport instability that pre-existed.
Asks
- Confirm
5533ad764's 120s default is intended for production Codex-OAuth-on-ChatGPT-account workloads or should be tuned higher.
- Consider exposing a per-call override or auto-scaling the deadline based on observed stream throughput.
- Either way, document the workaround for users hitting the regression.
Summary
After upgrading to v0.13.0 (v2026.5.7), the
APIConnectionError: Connection error.retry rate againstchatgpt.com/backend-api/codexincreased ~8x against the same workload. The retries persist even with PR #12953 (custom keepalive transport bypass) cherry-picked locally. Suspect commit5533ad764 fix(auxiliary): enforce Codex Responses stream timeoutis a contributing cause via a too-tight 120s default.Environment
eeef486baseline + two local cherry-picks (aaa700c65= PR fix(codex): avoid custom keepalive transport on chatgpt backend #12953 keepalive bypass,4ce6c96e2= PR fix(auxiliary): resolve provider/model from live runtime, not stale config #19485 runtime TLS).openai-codex(ChatGPT Codex OAuth).gpt-5.5main,gpt-5.4-miniauxiliary compression.chatgpt.com/backend-api/codex.Empirical data
Logs counted across
~/.hermes/logs/agent.log,~/.hermes/profiles/*/logs/agent.log, and~/.hermes/kanban/logs/t_*.log:Hourly post-upgrade peaks: 33–48 retries/hour. Pre-upgrade comparable: 0–2 retries/hour.
Symptom
Every retry signature in the post-upgrade window:
Sub-second elapsed means TLS-handshake-time RST. ~95%+ of retries succeed within 1–3 attempts (no max-retries-exhausted events).
Hypothesis
Commit
5533ad764enforces a hard total-elapsed timeout on the Codex Responses auxiliary stream. The defaultauxiliary.compression.timeout: 120sis too tight for compression workloads where 200K+ token sessions ongpt-5.4-miniroutinely take 60–180s. Pre-commit, slow streams completed; post-commit, they timeout at 120s, raiseTimeoutError, classify as retryable, and the outer agent loop fires a retry cycle that often surfaces as a transport-levelAPIConnectionErrordue to the forcibleclient.close()in_close_client_on_timeout.Workaround
In
~/.hermes/config.yaml:load_config()is mtime-cached so no restart needed. Reduces the compression-driven retry class. Connection-class retries (covered by #12952) are separate.Related
This issue is for the post-v0.13.0 amplification specifically, not the underlying transport instability that pre-existed.
Asks
5533ad764's 120s default is intended for production Codex-OAuth-on-ChatGPT-account workloads or should be tuned higher.