fix(codex): add time-to-first-byte watchdog for stalled Codex Responses streams by adam91holt · Pull Request #31984 · NousResearch/hermes-agent

adam91holt · 2026-05-25T09:30:30Z

What does this PR do?

Adds a time-to-first-byte (TTFB) watchdog for the Codex Responses streaming
path so a connection that is accepted but never produces a byte is recovered
in seconds instead of being held for the full wall-clock stale timeout.

The problem. Against the chatgpt.com/backend-api/codex backend we hit an
intermittent failure mode where the endpoint accepts the TCP/TLS connection but
never emits a single stream event — the socket just hangs. It eventually
surfaces as a stale-call kill or httpx.RemoteProtocolError: peer closed connection without sending complete message body. It reproduces when probing
the backend directly and sequentially (no concurrency): a request hangs with
zero events and no HTTP status, while an immediate fresh reconnect succeeds
within a couple of seconds. So it appears to be backend-side, and a fast
reconnect is the effective recovery.

Why the existing guard isn't enough. interruptible_api_call supervises the
Codex stream with a wall-clock stale detector. That timeout has to stay high
(it also covers legitimate long reasoning generations), so a dead-on-arrival
connection is held for the whole window before the retry loop can reconnect —
many seconds to minutes of wasted wall time per stall.

The fix. While no stream event has arrived yet, apply a much shorter TTFB
cutoff and kill the connection so the retry loop reconnects promptly. Once any
event arrives the stream is healthy and only the existing wall-clock stale
timeout applies, so long generations are never interrupted. The "bytes flowing"
signal is set on any event (not just output-text deltas), so reasoning-only
and tool-call-only turns are not mistaken for a stall. Gated to the
codex_responses path; the chat_completions non-stream, Anthropic and Bedrock
branches have no first-event signal and are untouched. The raised TimeoutError
flows through the existing retry path unchanged.

Related Issue

No tracked issue. Related prior art (different mechanism): #22277 / #22278 target
eager fallback on stream-stall timeouts; this PR instead adds a first-byte
kill on the codex_responses supervisor so the same provider reconnects fast
on the no-first-byte case.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

agent/codex_runtime.py — stamp agent._codex_stream_last_event_ts on every
Responses stream event so the supervisor can tell whether any byte has arrived.
agent/chat_completion_helpers.py — TTFB watchdog in interruptible_api_call:
before the first event, kill + retry once elapsed > HERMES_CODEX_TTFB_TIMEOUT_SECONDS
(default 45, set 0 to disable); after the first event, behaviour is unchanged.
tests/agent/test_codex_ttfb_watchdog.py — new regression tests.

How to Test

pytest tests/agent/test_codex_ttfb_watchdog.py -v — covers three cases:
no-event stall is killed quickly with a retryable TimeoutError and a
codex_ttfb_kill close reason; a stream that emits an event then runs past
the cutoff is not killed; HERMES_CODEX_TTFB_TIMEOUT_SECONDS=0 disables it.
Regression: pytest tests/agent/test_non_stream_stale_timeout.py tests/run_agent/test_run_agent_codex_responses.py tests/run_agent/test_streaming.py tests/run_agent/test_interrupt_propagation.py tests/run_agent/test_openai_client_lifecycle.py tests/run_agent/test_codex_xai_oauth_recovery.py -v — all pass.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(codex): ...)
I searched existing PRs for duplicates
My PR contains only changes related to this fix
I've run the affected test suites and they pass (see How to Test) — not the full suite
I've added tests for my changes
I've tested on my platform: Linux

Documentation & Housekeeping

No new config keys (env-var-only knob, documented in code) — N/A for cli-config.yaml.example
No architecture/workflow change — N/A for CONTRIBUTING.md/AGENTS.md
Cross-platform: pure-Python timing/threading, no platform-specific calls
No tool description/schema changes — N/A

The chatgpt.com/backend-api/codex endpoint has an intermittent failure mode where it accepts the connection but never emits a single stream event — the socket just hangs. Direct sequential probing reproduces it (0 events, no HTTP status), and a fresh reconnect then succeeds in ~2s. Today the only guard is the wall-clock stale timeout in interruptible_api_call, so a dead-on-arrival connection is held for the full stale window (90-900s depending on context / config) before the retry loop can reconnect — minutes of wasted wall time per stall, at a rate of ~20% of calls during affected windows. Add a TTFB watchdog scoped to the codex_responses path: - codex_runtime.run_codex_stream stamps agent._codex_stream_last_event_ts on *every* stream event (not just output-text deltas), so reasoning-only and tool-call-only turns are not mistaken for a stall. - interruptible_api_call resets that marker before the worker starts and, while it is still None, kills the connection once elapsed exceeds the TTFB cutoff (default 45s, tunable via HERMES_CODEX_TTFB_TIMEOUT_SECONDS, 0 disables). The raised TimeoutError flows through the existing retry path unchanged. Once any event has arrived the stream is healthy and only the existing wall-clock stale timeout applies, so legitimate long generations are never interrupted. Gated to codex_responses; the chat_completions non-stream, anthropic and bedrock branches have no first-event signal and are untouched. Adds tests/agent/test_codex_ttfb_watchdog.py covering the stall kill, the events-flowing pass-through, and the env-disable path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hclsys · 2026-05-25T10:02:12Z

Traced the watchdog end-to-end — it's correct and the wiring holds:

Marker is set: codex_runtime.py sets agent._codex_stream_last_event_ts = time.time() on each stream event, so the is None check in the detector genuinely means 'zero events so far'. The comment's claim that it advances on any event (so reasoning-only / tool-only turns aren't misread as a stall) checks out.
Reset ordering is safe: agent._codex_stream_last_event_ts = None runs before _call_start and before the worker thread starts, so a marker left from a previous call on this agent can't be misread as this call's first byte — which is exactly the failure the reset comment calls out.
Gating is right: api_mode == "codex_responses" only; the non-stream chat_completions / anthropic / bedrock branches have no incremental first-event signal, so applying TTFB there would be wrong, and it's correctly excluded.
Env override + <= 0 disables + the t.join(timeout=2.0) drain before synthesizing the TimeoutError are all reasonable.

One minor robustness note, non-blocking: the None-sentinel reset is correct for sequential calls on an agent. The only way it could misfire is if a previous call's worker were still draining and set the marker between this call's reset and this call's worker start — but that requires two overlapping _call()s on the same agent, which isn't how the turn loop drives it, so it's theoretical. Worth a one-line comment that the reset assumes single-flight per agent, but not a blocker. Nicely scoped fix with a dedicated regression test — LGTM.

#AI commit#

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder codex P2 Medium — degraded but workaround exists labels May 25, 2026

This was referenced May 25, 2026

fix(codex): add TTFB watchdog for stalled Codex Responses streams (#31984) #32042

Merged

fix(codex): combine Responses timeout fixes #31938

Closed

teknium1 added a commit that referenced this pull request May 25, 2026

chore(release): map adam91holt for PR #31984 salvage

20e45ab

teknium1 added a commit that referenced this pull request May 25, 2026

chore(release): map adam91holt for PR #31984 salvage

2b16de0

teknium1 closed this in #32042 May 25, 2026

bot-ted mentioned this pull request May 25, 2026

chore: sync with upstream main (2026-05-25) bot-ted/hermes-agent#50

Merged

daletkc pushed a commit to daletkc/hermes-agent that referenced this pull request May 25, 2026

chore(release): map adam91holt for PR NousResearch#31984 salvage

1b89e86

mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026

chore(release): map adam91holt for PR NousResearch#31984 salvage

9793336

Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026

chore(release): map adam91holt for PR NousResearch#31984 salvage

0b1e1a1

#AI commit#

mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026

chore(release): map adam91holt for PR NousResearch#31984 salvage

a508d4c

gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026

chore(release): map adam91holt for PR NousResearch#31984 salvage

4a2497f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(codex): add time-to-first-byte watchdog for stalled Codex Responses streams#31984

fix(codex): add time-to-first-byte watchdog for stalled Codex Responses streams#31984
adam91holt wants to merge 1 commit into
NousResearch:mainfrom
adam91holt:fix/codex-ttfb-watchdog

adam91holt commented May 25, 2026

Uh oh!

hclsys commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adam91holt commented May 25, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Uh oh!

hclsys commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants