Skip to content

[codex] Guard untrusted context probe shrink#14858

Closed
ztcshen wants to merge 1 commit into
NousResearch:mainfrom
ztcshen:codex/guard-untrusted-context-probe
Closed

[codex] Guard untrusted context probe shrink#14858
ztcshen wants to merge 1 commit into
NousResearch:mainfrom
ztcshen:codex/guard-untrusted-context-probe

Conversation

@ztcshen

@ztcshen ztcshen commented Apr 24, 2026

Copy link
Copy Markdown

Summary

Keep a known model context length when Hermes only guessed the next context probe tier and that guessed tier is already below the prompt being recovered.

This is a narrower alternative to #14499. It follows the same recovery shape as #14743: if the provider error does not contain a trustworthy parseable context limit, do not mutate context_length to an untrusted lower value. Compress with the known window instead, or fail closed if compression cannot reduce the session.

User impact

This is a high-frequency Codex/GPT-5.4 usability issue for my setup. I hit this at least five times per day, and it significantly disrupts normal Hermes usage. Local session history shows the issue was reported repeatedly on 2026-04-23 and 2026-04-24, including after hermes update.

The observed failure mode from the attached evidence in #14499:

Provider: openai-codex Model: gpt-5.4
Endpoint: https://chatgpt.com/backend-api/codex
Context: 492 msgs, ~285,378 tokens
Context length exceeded - stepping down: 1,050,000 -> 128,000 tokens
gpt-5.4 270K/128K

Once Hermes believes the context is 128K, an otherwise recoverable long-session overflow turns into repeated compression/failure against an artificially tiny window.

What changed

  • Added _should_keep_context_length_on_untrusted_probe(...).
  • When parse_context_limit_from_error(...) returns no concrete limit and get_next_probe_tier(...) would shrink below the current prompt estimate, keep old_ctx.
  • Preserve existing behavior when:
    • provider returned a concrete parseable limit;
    • guessed probe tier is still above current prompt estimate;
    • normal non-Codex unknown-provider probe-down applies.
  • Added pure helper tests and a run_conversation regression test modeled after fix(agent): preserve MiniMax context length on delta-only overflow (salvage #9170) #14743.

Related work

Validation

./venv/bin/python -m pytest tests/test_ctx_halving_fix.py tests/run_agent/test_run_agent.py::TestRunConversation::test_untrusted_probe_below_prompt_keeps_known_context_length tests/run_agent/test_run_agent.py::TestRunConversation::test_minimax_delta_overflow_keeps_known_context_length tests/run_agent/test_run_agent.py::TestRunConversation::test_non_minimax_delta_overflow_still_probes_down -q
30 passed in 7.37s
./venv/bin/python -m py_compile run_agent.py
git diff --check

Providers can report context overflow without a parseable limit. In that case Hermes currently guesses the next probe tier and may shrink a known large model window below the prompt already being recovered, such as gpt-5.4 moving from 1,050,000 to 128,000 for a ~285K prompt.

This keeps the known context length when the guessed tier is below the current prompt estimate, then lets compression either recover or fail without inventing a smaller model window.

Tested: ./venv/bin/python -m pytest tests/test_ctx_halving_fix.py tests/run_agent/test_run_agent.py::TestRunConversation::test_untrusted_probe_below_prompt_keeps_known_context_length tests/run_agent/test_run_agent.py::TestRunConversation::test_minimax_delta_overflow_keeps_known_context_length tests/run_agent/test_run_agent.py::TestRunConversation::test_non_minimax_delta_overflow_still_probes_down -q

Tested: ./venv/bin/python -m py_compile run_agent.py && git diff --check
@ztcshen

ztcshen commented Apr 24, 2026

Copy link
Copy Markdown
Author

Screenshot evidence from the original reproduction in #14499:

gpt-5.4 context probe collapse screenshot

The important part is the context estimate and mutation shown there:

Context: 492 msgs, ~285,378 tokens
Context length exceeded - stepping down: 1,050,000 -> 128,000 tokens
gpt-5.4 270K/128K

This is the specific fail-open path this PR guards: when the provider did not give Hermes a parseable context limit, the guessed tier should not overwrite a known 1,050,000-token window with 128,000 while the active prompt estimate is already ~285K. For my local Codex/Hermes workflow this happens at least five times per day and makes long sessions effectively unusable after the first bad probe.

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder provider/openai OpenAI / Codex Responses API labels Apr 24, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #14499 (broader probe tier fix) and #9181 (architecture: separate base vs effective context). This is a narrower, safer alternative to #14499.

@ztcshen

ztcshen commented Apr 24, 2026

Copy link
Copy Markdown
Author

Additional note after switching models today: this is not intended to be GPT-5.4-specific. I have also reproduced the same failure mode after switching to the latest GPT-5.5 path.

The guard in this PR is intentionally model-agnostic. It applies when all of these are true:

  • the provider reports a context overflow;
  • Hermes cannot parse a trustworthy concrete context limit from that provider error;
  • the guessed next probe tier would be below the prompt currently being recovered.

So although the original screenshot evidence shows gpt-5.4 stepping from 1,050,000 to 128,000, the same protection should apply to GPT-5.5 or any future large-context model that hits the same untrusted probe-shrink path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround provider/openai OpenAI / Codex Responses API type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants