fix(codex): add TTFB watchdog for stalled Codex Responses streams (#31984) by teknium1 · Pull Request #32042 · NousResearch/hermes-agent

teknium1 · 2026-05-25T11:51:11Z

Salvage of #31984 by @adam91holt onto current main. Cherry-pick was clean on top of today's #29507 + Kasun's Codex-timeout merge.

What this PR does

Adds a time-to-first-byte (TTFB) watchdog on the codex_responses streaming path. While zero stream events have arrived, a much shorter cutoff (HERMES_CODEX_TTFB_TIMEOUT_SECONDS, default 45s, 0 disables) kills the connection so the existing retry loop can reconnect. Once any event arrives the stream is healthy and only the wall-clock stale timeout applies — long reasoning generations are never interrupted.

Why

chatgpt.com/backend-api/codex sometimes accepts the TCP/TLS connection but never emits a single stream event. The wall-clock stale timeout has to stay high (180–900s) to cover legitimate long generations, so a dead-on-arrival connection burned that whole window before retry. A fresh reconnect typically succeeds in ~2s.

Reported in community by @adam91holt with direct backend probing — sequential probes hit the no-first-byte hang, immediate reconnects succeed.

How

agent/codex_runtime.py — stamp agent._codex_stream_last_event_ts on every Responses event (any event, so reasoning-only and tool-call-only turns are not mistaken for a stall).
agent/chat_completion_helpers.py — TTFB watchdog in interruptible_api_call, gated to api_mode == 'codex_responses'. Before the first event, kill + retry once elapsed exceeds the cutoff; after the first event, behavior is unchanged.
tests/agent/test_codex_ttfb_watchdog.py — 3 regression tests (kill on no-event, no-kill when events flow, env-disable).

Scope rationale: this is codex-only by design. The chat_completions streaming path already has HERMES_STREAM_READ_TIMEOUT (default 120s) wired into httpx.Timeout(read=...) plus an in-loop stale-chunk detector. Non-stream paths (chat_completions non-stream, anthropic_messages, bedrock_converse) have no intermediate event signal — TTFB is structurally equivalent to the full stale timeout there. So the gap is just the codex stream path, and that's what this fills.

Validation

tests/agent/test_codex_ttfb_watchdog.py                        3 passed
tests/agent/test_non_stream_stale_timeout.py                  12 passed
tests/run_agent/test_run_agent_codex_responses.py             65 passed
tests/run_agent/test_streaming.py                             40 passed
tests/run_agent/test_interrupt_propagation.py                  7 passed
tests/run_agent/test_openai_client_lifecycle.py                5 passed
tests/run_agent/test_codex_xai_oauth_recovery.py              37 passed
------------------------------------------------------------------------
                                                             169 passed

Closes #31984. Authorship preserved via cherry-pick.

Infographic

hclsys · 2026-05-25T12:09:51Z

I reviewed the original #31984 in detail, so confirming this salvage is faithful: the cherry-pick preserves the load-bearing parts — codex_runtime still sets agent._codex_stream_last_event_ts = time.time() on each event (so is None genuinely means 'zero events'), the reset runs before the worker starts (no stale-marker false-positive), gating stays api_mode == "codex_responses" only, and tests/agent/test_codex_ttfb_watchdog.py came across intact. Good etiquette crediting @adam91holt + linking #31984, and the AUTHOR_MAP entry in scripts/release.py is the right housekeeping for a salvage.

Carrying forward the one open note from my #31984 review so it isn't lost in the rebase (still non-blocking): the None-sentinel reset (agent._codex_stream_last_event_ts = None before the worker starts) assumes single-flight per agent — if two _call()s ever overlapped on the same agent, a previous call's worker could set the marker between this call's reset and worker start, masking a real stall. That's not how the turn loop drives it today, so it's theoretical, but a one-line comment noting the single-flight assumption would help a future reader. Logic is sound and unchanged from the version I verified on #31984 — LGTM for the salvage.

github-actions · 2026-05-25T12:14:25Z

🔎 Lint report: `hermes/hermes-f0f87402` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9325 on HEAD, 9320 on base (🆕 +5)

🆕 New issues (3):

Rule	Count
`unresolved-attribute`	1
`unresolved-import`	1
`no-matching-overload`	1

First entries

tests/agent/test_codex_ttfb_watchdog.py:50: [unresolved-attribute] unresolved-attribute: Unresolved attribute `api_mode` on type `AIAgent`
tests/agent/test_codex_ttfb_watchdog.py:24: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/agent/test_codex_ttfb_watchdog.py:29: [no-matching-overload] no-matching-overload: No overload of bound method `MutableMapping.setdefault` matches arguments

✅ Fixed issues: none

Unchanged: 4931 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

The chatgpt.com/backend-api/codex endpoint has an intermittent failure mode where it accepts the connection but never emits a single stream event — the socket just hangs. Direct sequential probing reproduces it (0 events, no HTTP status), and a fresh reconnect then succeeds in ~2s. Today the only guard is the wall-clock stale timeout in interruptible_api_call, so a dead-on-arrival connection is held for the full stale window (90-900s depending on context / config) before the retry loop can reconnect — minutes of wasted wall time per stall, at a rate of ~20% of calls during affected windows. Add a TTFB watchdog scoped to the codex_responses path: - codex_runtime.run_codex_stream stamps agent._codex_stream_last_event_ts on *every* stream event (not just output-text deltas), so reasoning-only and tool-call-only turns are not mistaken for a stall. - interruptible_api_call resets that marker before the worker starts and, while it is still None, kills the connection once elapsed exceeds the TTFB cutoff (default 45s, tunable via HERMES_CODEX_TTFB_TIMEOUT_SECONDS, 0 disables). The raised TimeoutError flows through the existing retry path unchanged. Once any event has arrived the stream is healthy and only the existing wall-clock stale timeout applies, so legitimate long generations are never interrupted. Gated to codex_responses; the chat_completions non-stream, anthropic and bedrock branches have no first-event signal and are untouched. Adds tests/agent/test_codex_ttfb_watchdog.py covering the stall kill, the events-flowing pass-through, and the env-disable path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…32072) All four failures were broken by the security cluster (#10082 / #10133 / #4609 / symlink-reject batch) merging on May 25. They were red on origin/main HEAD when #32042 and #32061 ran, gating PRs that touched unrelated code. 1) tests/hermes_cli/test_update_zip_symlink_reject.py test_update_via_zip_accepts_normal_member called the real _update_via_zip without sandboxing PROJECT_ROOT — so the function's shutil.copytree() actually copied the fake README from the test ZIP over the real repo's README.md, which then made test_readme_mentions_powershell_installer fail in any test run that happened to pick this test up earlier. Mock PROJECT_ROOT to an isolated tmp_path / install_dir, stub subprocess so pip/uv reinstall doesn't actually run, and assert the fake README lands in the sandbox (not the real tree). 2) tests/tools/test_windows_native_support.py test_readme_mentions_powershell_installer was the victim of (1) — nothing wrong with the test itself, the fix in (1) clears it. 3) tests/tools/test_file_read_guards.py test_proc_fd_other_not_blocked called _is_blocked_device('/proc/self/fd/3') expecting False. But _is_blocked_device runs realpath() and on pytest xdist workers fd 3 happens to be dup'd to /dev/urandom (because the worker subprocess inherits open fds from pytest's collection pipe machinery). Switch to the lower-level _is_blocked_device_path which is the path-pattern check the test actually means to exercise; realpath-resolution coverage already lives in test_symlink_to_blocked_device_is_blocked. 4) tests/tools/test_transcription_tools.py Module installed a faster_whisper stub via sys.modules without setting __spec__, then later @pytest.mark.skipif called importlib.util.find_spec('faster_whisper') which raises 'ValueError: __spec__ is None' for modules with a None spec attr. Set __spec__ on the stub to a real ModuleSpec. Validation: 195/195 green across the 4 affected files.

…ousResearch#32072) All four failures were broken by the security cluster (NousResearch#10082 / NousResearch#10133 / NousResearch#4609 / symlink-reject batch) merging on May 25. They were red on origin/main HEAD when NousResearch#32042 and NousResearch#32061 ran, gating PRs that touched unrelated code. 1) tests/hermes_cli/test_update_zip_symlink_reject.py test_update_via_zip_accepts_normal_member called the real _update_via_zip without sandboxing PROJECT_ROOT — so the function's shutil.copytree() actually copied the fake README from the test ZIP over the real repo's README.md, which then made test_readme_mentions_powershell_installer fail in any test run that happened to pick this test up earlier. Mock PROJECT_ROOT to an isolated tmp_path / install_dir, stub subprocess so pip/uv reinstall doesn't actually run, and assert the fake README lands in the sandbox (not the real tree). 2) tests/tools/test_windows_native_support.py test_readme_mentions_powershell_installer was the victim of (1) — nothing wrong with the test itself, the fix in (1) clears it. 3) tests/tools/test_file_read_guards.py test_proc_fd_other_not_blocked called _is_blocked_device('/proc/self/fd/3') expecting False. But _is_blocked_device runs realpath() and on pytest xdist workers fd 3 happens to be dup'd to /dev/urandom (because the worker subprocess inherits open fds from pytest's collection pipe machinery). Switch to the lower-level _is_blocked_device_path which is the path-pattern check the test actually means to exercise; realpath-resolution coverage already lives in test_symlink_to_blocked_device_is_blocked. 4) tests/tools/test_transcription_tools.py Module installed a faster_whisper stub via sys.modules without setting __spec__, then later @pytest.mark.skipif called importlib.util.find_spec('faster_whisper') which raises 'ValueError: __spec__ is None' for modules with a None spec attr. Set __spec__ on the stub to a real ModuleSpec. Validation: 195/195 green across the 4 affected files.

…ousResearch#32072) All four failures were broken by the security cluster (NousResearch#10082 / NousResearch#10133 / NousResearch#4609 / symlink-reject batch) merging on May 25. They were red on origin/main HEAD when NousResearch#32042 and NousResearch#32061 ran, gating PRs that touched unrelated code. 1) tests/hermes_cli/test_update_zip_symlink_reject.py test_update_via_zip_accepts_normal_member called the real _update_via_zip without sandboxing PROJECT_ROOT — so the function's shutil.copytree() actually copied the fake README from the test ZIP over the real repo's README.md, which then made test_readme_mentions_powershell_installer fail in any test run that happened to pick this test up earlier. Mock PROJECT_ROOT to an isolated tmp_path / install_dir, stub subprocess so pip/uv reinstall doesn't actually run, and assert the fake README lands in the sandbox (not the real tree). 2) tests/tools/test_windows_native_support.py test_readme_mentions_powershell_installer was the victim of (1) — nothing wrong with the test itself, the fix in (1) clears it. 3) tests/tools/test_file_read_guards.py test_proc_fd_other_not_blocked called _is_blocked_device('/proc/self/fd/3') expecting False. But _is_blocked_device runs realpath() and on pytest xdist workers fd 3 happens to be dup'd to /dev/urandom (because the worker subprocess inherits open fds from pytest's collection pipe machinery). Switch to the lower-level _is_blocked_device_path which is the path-pattern check the test actually means to exercise; realpath-resolution coverage already lives in test_symlink_to_blocked_device_is_blocked. 4) tests/tools/test_transcription_tools.py Module installed a faster_whisper stub via sys.modules without setting __spec__, then later @pytest.mark.skipif called importlib.util.find_spec('faster_whisper') which raises 'ValueError: __spec__ is None' for modules with a None spec attr. Set __spec__ on the stub to a real ModuleSpec. Validation: 195/195 green across the 4 affected files. #AI commit#

…ousResearch#32072) All four failures were broken by the security cluster (NousResearch#10082 / NousResearch#10133 / NousResearch#4609 / symlink-reject batch) merging on May 25. They were red on origin/main HEAD when NousResearch#32042 and NousResearch#32061 ran, gating PRs that touched unrelated code. 1) tests/hermes_cli/test_update_zip_symlink_reject.py test_update_via_zip_accepts_normal_member called the real _update_via_zip without sandboxing PROJECT_ROOT — so the function's shutil.copytree() actually copied the fake README from the test ZIP over the real repo's README.md, which then made test_readme_mentions_powershell_installer fail in any test run that happened to pick this test up earlier. Mock PROJECT_ROOT to an isolated tmp_path / install_dir, stub subprocess so pip/uv reinstall doesn't actually run, and assert the fake README lands in the sandbox (not the real tree). 2) tests/tools/test_windows_native_support.py test_readme_mentions_powershell_installer was the victim of (1) — nothing wrong with the test itself, the fix in (1) clears it. 3) tests/tools/test_file_read_guards.py test_proc_fd_other_not_blocked called _is_blocked_device('/proc/self/fd/3') expecting False. But _is_blocked_device runs realpath() and on pytest xdist workers fd 3 happens to be dup'd to /dev/urandom (because the worker subprocess inherits open fds from pytest's collection pipe machinery). Switch to the lower-level _is_blocked_device_path which is the path-pattern check the test actually means to exercise; realpath-resolution coverage already lives in test_symlink_to_blocked_device_is_blocked. 4) tests/tools/test_transcription_tools.py Module installed a faster_whisper stub via sys.modules without setting __spec__, then later @pytest.mark.skipif called importlib.util.find_spec('faster_whisper') which raises 'ValueError: __spec__ is None' for modules with a None spec attr. Set __spec__ on the stub to a real ModuleSpec. Validation: 195/195 green across the 4 affected files.

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder codex labels May 25, 2026

teknium1 mentioned this pull request May 25, 2026

fix(codex): combine Responses timeout fixes #31938

Closed

adam91holt and others added 2 commits May 25, 2026 05:23

chore(release): map adam91holt for PR #31984 salvage

20e45ab

teknium1 force-pushed the hermes/hermes-f0f87402 branch from 41ef21d to 20e45ab Compare May 25, 2026 12:24

teknium1 merged commit 2b16de0 into main May 25, 2026
24 of 26 checks passed

teknium1 deleted the hermes/hermes-f0f87402 branch May 25, 2026 12:34

teknium1 mentioned this pull request May 25, 2026

fix(tests): four pre-existing flakes from the security cluster merge #32072

Merged

alt-glitch mentioned this pull request May 25, 2026

fix(codex): activity watchdog for codex_responses stale detector #32131

Closed

3 tasks

BrewTestBot mentioned this pull request May 28, 2026

hermes-agent 2026.5.28 Homebrew/homebrew-core#285115

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(codex): add TTFB watchdog for stalled Codex Responses streams (#31984)#32042

fix(codex): add TTFB watchdog for stalled Codex Responses streams (#31984)#32042
teknium1 merged 2 commits into
mainfrom
hermes/hermes-f0f87402

teknium1 commented May 25, 2026 •

edited

Loading

Uh oh!

hclsys commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

teknium1 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Why

How

Validation

Infographic

Uh oh!

hclsys commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: hermes/hermes-f0f87402 vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

teknium1 commented May 25, 2026 •

edited

Loading

github-actions Bot commented May 25, 2026 •

edited

Loading

🔎 Lint report: `hermes/hermes-f0f87402` vs `origin/main`