fix(codex): add time-to-first-byte watchdog for stalled Codex Responses streams#31984
Closed
adam91holt wants to merge 1 commit into
Closed
fix(codex): add time-to-first-byte watchdog for stalled Codex Responses streams#31984adam91holt wants to merge 1 commit into
adam91holt wants to merge 1 commit into
Conversation
The chatgpt.com/backend-api/codex endpoint has an intermittent failure mode where it accepts the connection but never emits a single stream event — the socket just hangs. Direct sequential probing reproduces it (0 events, no HTTP status), and a fresh reconnect then succeeds in ~2s. Today the only guard is the wall-clock stale timeout in interruptible_api_call, so a dead-on-arrival connection is held for the full stale window (90-900s depending on context / config) before the retry loop can reconnect — minutes of wasted wall time per stall, at a rate of ~20% of calls during affected windows. Add a TTFB watchdog scoped to the codex_responses path: - codex_runtime.run_codex_stream stamps agent._codex_stream_last_event_ts on *every* stream event (not just output-text deltas), so reasoning-only and tool-call-only turns are not mistaken for a stall. - interruptible_api_call resets that marker before the worker starts and, while it is still None, kills the connection once elapsed exceeds the TTFB cutoff (default 45s, tunable via HERMES_CODEX_TTFB_TIMEOUT_SECONDS, 0 disables). The raised TimeoutError flows through the existing retry path unchanged. Once any event has arrived the stream is healthy and only the existing wall-clock stale timeout applies, so legitimate long generations are never interrupted. Gated to codex_responses; the chat_completions non-stream, anthropic and bedrock branches have no first-event signal and are untouched. Adds tests/agent/test_codex_ttfb_watchdog.py covering the stall kill, the events-flowing pass-through, and the env-disable path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Traced the watchdog end-to-end — it's correct and the wiring holds:
One minor robustness note, non-blocking: the |
This was referenced May 25, 2026
teknium1
added a commit
that referenced
this pull request
May 25, 2026
teknium1
added a commit
that referenced
this pull request
May 25, 2026
daletkc
pushed a commit
to daletkc/hermes-agent
that referenced
this pull request
May 25, 2026
mathias3
pushed a commit
to mathias3/hermes-agent
that referenced
this pull request
May 28, 2026
Bryce-huang
pushed a commit
to wbkunlun/hermes-agent
that referenced
this pull request
May 29, 2026
mosaiq-systems
pushed a commit
to mosaiq-systems/hermes-agent
that referenced
this pull request
May 29, 2026
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a time-to-first-byte (TTFB) watchdog for the Codex Responses streaming
path so a connection that is accepted but never produces a byte is recovered
in seconds instead of being held for the full wall-clock stale timeout.
The problem. Against the
chatgpt.com/backend-api/codexbackend we hit anintermittent failure mode where the endpoint accepts the TCP/TLS connection but
never emits a single stream event — the socket just hangs. It eventually
surfaces as a stale-call kill or
httpx.RemoteProtocolError: peer closed connection without sending complete message body. It reproduces when probingthe backend directly and sequentially (no concurrency): a request hangs with
zero events and no HTTP status, while an immediate fresh reconnect succeeds
within a couple of seconds. So it appears to be backend-side, and a fast
reconnect is the effective recovery.
Why the existing guard isn't enough.
interruptible_api_callsupervises theCodex stream with a wall-clock stale detector. That timeout has to stay high
(it also covers legitimate long reasoning generations), so a dead-on-arrival
connection is held for the whole window before the retry loop can reconnect —
many seconds to minutes of wasted wall time per stall.
The fix. While no stream event has arrived yet, apply a much shorter TTFB
cutoff and kill the connection so the retry loop reconnects promptly. Once any
event arrives the stream is healthy and only the existing wall-clock stale
timeout applies, so long generations are never interrupted. The "bytes flowing"
signal is set on any event (not just output-text deltas), so reasoning-only
and tool-call-only turns are not mistaken for a stall. Gated to the
codex_responsespath; the chat_completions non-stream, Anthropic and Bedrockbranches have no first-event signal and are untouched. The raised
TimeoutErrorflows through the existing retry path unchanged.
Related Issue
No tracked issue. Related prior art (different mechanism): #22277 / #22278 target
eager fallback on stream-stall timeouts; this PR instead adds a first-byte
kill on the
codex_responsessupervisor so the same provider reconnects faston the no-first-byte case.
Type of Change
Changes Made
agent/codex_runtime.py— stampagent._codex_stream_last_event_tson everyResponses stream event so the supervisor can tell whether any byte has arrived.
agent/chat_completion_helpers.py— TTFB watchdog ininterruptible_api_call:before the first event, kill + retry once
elapsed > HERMES_CODEX_TTFB_TIMEOUT_SECONDS(default
45, set0to disable); after the first event, behaviour is unchanged.tests/agent/test_codex_ttfb_watchdog.py— new regression tests.How to Test
pytest tests/agent/test_codex_ttfb_watchdog.py -v— covers three cases:no-event stall is killed quickly with a retryable
TimeoutErrorand acodex_ttfb_killclose reason; a stream that emits an event then runs pastthe cutoff is not killed;
HERMES_CODEX_TTFB_TIMEOUT_SECONDS=0disables it.pytest tests/agent/test_non_stream_stale_timeout.py tests/run_agent/test_run_agent_codex_responses.py tests/run_agent/test_streaming.py tests/run_agent/test_interrupt_propagation.py tests/run_agent/test_openai_client_lifecycle.py tests/run_agent/test_codex_xai_oauth_recovery.py -v— all pass.Checklist
Code
fix(codex): ...)Documentation & Housekeeping
cli-config.yaml.exampleCONTRIBUTING.md/AGENTS.md