[Bug] Codex APIConnectionError retry rate ~8x higher post-v0.13.0; persists with #12953 applied; suspected commit 5533ad764 strict stream-timeout enforcement

## Summary

After upgrading to v0.13.0 (v2026.5.7), the `APIConnectionError: Connection error.` retry rate against `chatgpt.com/backend-api/codex` increased ~8x against the same workload. The retries persist even with PR #12953 (custom keepalive transport bypass) cherry-picked locally. Suspect commit `5533ad764 fix(auxiliary): enforce Codex Responses stream timeout` is a contributing cause via a too-tight 120s default.

## Environment

- Hermes Agent: v0.13.0 (v2026.5.7), commit `eeef486` baseline + two local cherry-picks (`aaa700c65` = PR #12953 keepalive bypass, `4ce6c96e2` = PR #19485 runtime TLS).
- Provider: `openai-codex` (ChatGPT Codex OAuth).
- Models: `gpt-5.5` main, `gpt-5.4-mini` auxiliary compression.
- Backend: `chatgpt.com/backend-api/codex`.
- Platform: macOS 15.x arm64.

## Empirical data

Logs counted across `~/.hermes/logs/agent.log`, `~/.hermes/profiles/*/logs/agent.log`, and `~/.hermes/kanban/logs/t_*.log`:

- 2026-05-08 (pre-upgrade, full day): 21 retries.
- 2026-05-09 12:00–22:00 CDT (post-upgrade): 171+ retries.
- Same user, same workload class (Telegram-driven agent interactions plus an internal multi-agent workload), same network/IP.

Hourly post-upgrade peaks: 33–48 retries/hour. Pre-upgrade comparable: 0–2 retries/hour.

## Symptom

Every retry signature in the post-upgrade window:

```
⚠️  API call failed (attempt 1/3): APIConnectionError
   🔌 Provider: openai-codex  Model: gpt-5.5
   🌐 Endpoint: https://chatgpt.com/backend-api/codex
   📝 Error: Connection error.
   ⏱️  Elapsed: 0.13–0.5s  Context: varies
⏳ Retrying in 2-5s (attempt 1/3)...
```

Sub-second elapsed means TLS-handshake-time RST. ~95%+ of retries succeed within 1–3 attempts (no max-retries-exhausted events).

## Hypothesis

Commit `5533ad764` enforces a hard total-elapsed timeout on the Codex Responses auxiliary stream. The default `auxiliary.compression.timeout: 120s` is too tight for compression workloads where 200K+ token sessions on `gpt-5.4-mini` routinely take 60–180s. Pre-commit, slow streams completed; post-commit, they timeout at 120s, raise `TimeoutError`, classify as retryable, and the outer agent loop fires a retry cycle that often surfaces as a transport-level `APIConnectionError` due to the forcible `client.close()` in `_close_client_on_timeout`.

## Workaround

In `~/.hermes/config.yaml`:

```yaml
auxiliary:
  compression:
    timeout: 300   # was 120
```

`load_config()` is mtime-cached so no restart needed. Reduces the compression-driven retry class. Connection-class retries (covered by #12952) are separate.

## Related

- #12952 / PR #12953 — custom keepalive transport breaks chatgpt codex backend (cherry-picked locally; reduces but does not eliminate).
- #16670 / PR #16737 — compression fallback marker after incomplete chunked read.
- PR #21761 — recover Codex stream drops (auxiliary path).

This issue is for the post-v0.13.0 amplification specifically, not the underlying transport instability that pre-existed.

## Asks

1. Confirm `5533ad764`'s 120s default is intended for production Codex-OAuth-on-ChatGPT-account workloads or should be tuned higher.
2. Consider exposing a per-call override or auto-scaling the deadline based on observed stream throughput.
3. Either way, document the workaround for users hitting the regression.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Codex APIConnectionError retry rate ~8x higher post-v0.13.0; persists with #12953 applied; suspected commit 5533ad764 strict stream-timeout enforcement #22986

Summary

Environment

Empirical data

Symptom

Hypothesis

Workaround

Related

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Codex APIConnectionError retry rate ~8x higher post-v0.13.0; persists with #12953 applied; suspected commit 5533ad764 strict stream-timeout enforcement #22986

Description

Summary

Environment

Empirical data

Symptom

Hypothesis

Workaround

Related

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions