Skip to content

fix(codex): add time-to-first-byte watchdog for stalled Codex Responses streams#31984

Closed
adam91holt wants to merge 1 commit into
NousResearch:mainfrom
adam91holt:fix/codex-ttfb-watchdog
Closed

fix(codex): add time-to-first-byte watchdog for stalled Codex Responses streams#31984
adam91holt wants to merge 1 commit into
NousResearch:mainfrom
adam91holt:fix/codex-ttfb-watchdog

Conversation

@adam91holt

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a time-to-first-byte (TTFB) watchdog for the Codex Responses streaming
path so a connection that is accepted but never produces a byte is recovered
in seconds instead of being held for the full wall-clock stale timeout.

The problem. Against the chatgpt.com/backend-api/codex backend we hit an
intermittent failure mode where the endpoint accepts the TCP/TLS connection but
never emits a single stream event — the socket just hangs. It eventually
surfaces as a stale-call kill or httpx.RemoteProtocolError: peer closed connection without sending complete message body. It reproduces when probing
the backend directly and sequentially (no concurrency): a request hangs with
zero events and no HTTP status, while an immediate fresh reconnect succeeds
within a couple of seconds. So it appears to be backend-side, and a fast
reconnect is the effective recovery.

Why the existing guard isn't enough. interruptible_api_call supervises the
Codex stream with a wall-clock stale detector. That timeout has to stay high
(it also covers legitimate long reasoning generations), so a dead-on-arrival
connection is held for the whole window before the retry loop can reconnect —
many seconds to minutes of wasted wall time per stall.

The fix. While no stream event has arrived yet, apply a much shorter TTFB
cutoff and kill the connection so the retry loop reconnects promptly. Once any
event arrives the stream is healthy and only the existing wall-clock stale
timeout applies, so long generations are never interrupted. The "bytes flowing"
signal is set on any event (not just output-text deltas), so reasoning-only
and tool-call-only turns are not mistaken for a stall. Gated to the
codex_responses path; the chat_completions non-stream, Anthropic and Bedrock
branches have no first-event signal and are untouched. The raised TimeoutError
flows through the existing retry path unchanged.

Related Issue

No tracked issue. Related prior art (different mechanism): #22277 / #22278 target
eager fallback on stream-stall timeouts; this PR instead adds a first-byte
kill on the codex_responses supervisor so the same provider reconnects fast
on the no-first-byte case.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • agent/codex_runtime.py — stamp agent._codex_stream_last_event_ts on every
    Responses stream event so the supervisor can tell whether any byte has arrived.
  • agent/chat_completion_helpers.py — TTFB watchdog in interruptible_api_call:
    before the first event, kill + retry once elapsed > HERMES_CODEX_TTFB_TIMEOUT_SECONDS
    (default 45, set 0 to disable); after the first event, behaviour is unchanged.
  • tests/agent/test_codex_ttfb_watchdog.py — new regression tests.

How to Test

  1. pytest tests/agent/test_codex_ttfb_watchdog.py -v — covers three cases:
    no-event stall is killed quickly with a retryable TimeoutError and a
    codex_ttfb_kill close reason; a stream that emits an event then runs past
    the cutoff is not killed; HERMES_CODEX_TTFB_TIMEOUT_SECONDS=0 disables it.
  2. Regression: pytest tests/agent/test_non_stream_stale_timeout.py tests/run_agent/test_run_agent_codex_responses.py tests/run_agent/test_streaming.py tests/run_agent/test_interrupt_propagation.py tests/run_agent/test_openai_client_lifecycle.py tests/run_agent/test_codex_xai_oauth_recovery.py -v — all pass.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(codex): ...)
  • I searched existing PRs for duplicates
  • My PR contains only changes related to this fix
  • I've run the affected test suites and they pass (see How to Test) — not the full suite
  • I've added tests for my changes
  • I've tested on my platform: Linux

Documentation & Housekeeping

  • No new config keys (env-var-only knob, documented in code) — N/A for cli-config.yaml.example
  • No architecture/workflow change — N/A for CONTRIBUTING.md/AGENTS.md
  • Cross-platform: pure-Python timing/threading, no platform-specific calls
  • No tool description/schema changes — N/A

The chatgpt.com/backend-api/codex endpoint has an intermittent failure mode
where it accepts the connection but never emits a single stream event — the
socket just hangs. Direct sequential probing reproduces it (0 events, no HTTP
status), and a fresh reconnect then succeeds in ~2s. Today the only guard is
the wall-clock stale timeout in interruptible_api_call, so a dead-on-arrival
connection is held for the full stale window (90-900s depending on context /
config) before the retry loop can reconnect — minutes of wasted wall time per
stall, at a rate of ~20% of calls during affected windows.

Add a TTFB watchdog scoped to the codex_responses path:

- codex_runtime.run_codex_stream stamps agent._codex_stream_last_event_ts on
  *every* stream event (not just output-text deltas), so reasoning-only and
  tool-call-only turns are not mistaken for a stall.
- interruptible_api_call resets that marker before the worker starts and, while
  it is still None, kills the connection once elapsed exceeds the TTFB cutoff
  (default 45s, tunable via HERMES_CODEX_TTFB_TIMEOUT_SECONDS, 0 disables). The
  raised TimeoutError flows through the existing retry path unchanged.

Once any event has arrived the stream is healthy and only the existing
wall-clock stale timeout applies, so legitimate long generations are never
interrupted. Gated to codex_responses; the chat_completions non-stream,
anthropic and bedrock branches have no first-event signal and are untouched.

Adds tests/agent/test_codex_ttfb_watchdog.py covering the stall kill, the
events-flowing pass-through, and the env-disable path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder codex P2 Medium — degraded but workaround exists labels May 25, 2026
@hclsys

hclsys commented May 25, 2026

Copy link
Copy Markdown

Traced the watchdog end-to-end — it's correct and the wiring holds:

  • Marker is set: codex_runtime.py sets agent._codex_stream_last_event_ts = time.time() on each stream event, so the is None check in the detector genuinely means 'zero events so far'. The comment's claim that it advances on any event (so reasoning-only / tool-only turns aren't misread as a stall) checks out.
  • Reset ordering is safe: agent._codex_stream_last_event_ts = None runs before _call_start and before the worker thread starts, so a marker left from a previous call on this agent can't be misread as this call's first byte — which is exactly the failure the reset comment calls out.
  • Gating is right: api_mode == "codex_responses" only; the non-stream chat_completions / anthropic / bedrock branches have no incremental first-event signal, so applying TTFB there would be wrong, and it's correctly excluded.
  • Env override + <= 0 disables + the t.join(timeout=2.0) drain before synthesizing the TimeoutError are all reasonable.

One minor robustness note, non-blocking: the None-sentinel reset is correct for sequential calls on an agent. The only way it could misfire is if a previous call's worker were still draining and set the marker between this call's reset and this call's worker start — but that requires two overlapping _call()s on the same agent, which isn't how the turn loop drives it, so it's theoretical. Worth a one-line comment that the reset assumes single-flight per agent, but not a blocker. Nicely scoped fix with a dedicated regression test — LGTM.

daletkc pushed a commit to daletkc/hermes-agent that referenced this pull request May 25, 2026
mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codex comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants