Bug description
agent.error_classifier.classify_api_error() can misclassify generic HTTP 400 errors and server disconnects as FailoverReason.context_overflow in explicitly large-context sessions (for example 1M-token Codex/GPT-5.x sessions), even when the prompt is far below the configured context window.
The problematic path is the absolute size/message-count heuristic. On current main, a generic 400 with many messages is classified as context overflow because num_messages > 80, even when approx_tokens is only ~74K against a 1M context window.
Minimal reproduction
from agent.error_classifier import classify_api_error
class FakeHTTP400(Exception):
status_code = 400
body = {"error": {"message": "Error"}}
def __str__(self):
return "Error"
result = classify_api_error(
FakeHTTP400(),
provider="openai-codex",
model="gpt-5.5",
approx_tokens=74320,
context_length=1_000_000,
num_messages=432,
)
print(result.reason, result.retryable, result.should_compress)
Current result:
FailoverReason.context_overflow True True
Expected result:
FailoverReason.format_error False False
A similar issue exists for server disconnect messages with the same low token pressure / high message count shape: the absolute num_messages > 200 branch classifies it as context_overflow instead of a transport/timeout condition.
Root cause
Current agent/error_classifier.py has heuristics equivalent to:
# server disconnect path
is_large = approx_tokens > context_length * 0.6 or approx_tokens > 120000 or num_messages > 200
# generic 400 path
is_large = approx_tokens > context_length * 0.4 or approx_tokens > 80000 or num_messages > 80
The absolute fallbacks are reasonable for ~128K/200K context windows, but they are too aggressive for 1M-context sessions. A long session can have hundreds of messages while still being well below the actual context budget.
User impact
This sends non-context errors into the context-overflow recovery path. In long-context Codex sessions, that can cause unnecessary compression and runtime context probe-down from an explicit 1M window to lower probe tiers (currently 256K/128K depending on branch/version), which can lead to repeated compaction and stale handoff pollution.
Suggested fix
Gate the absolute token/message-count heuristics to smaller context windows, and require relative pressure for large-context models. For example:
# server disconnect path
is_large = approx_tokens > context_length * 0.6 or (
context_length <= 256000 and (approx_tokens > 120000 or num_messages > 200)
)
# generic 400 path
is_large = approx_tokens > context_length * 0.4 or (
context_length <= 256000 and (approx_tokens > 80000 or num_messages > 80)
)
This preserves existing behavior for smaller context windows while preventing 1M sessions from being classified as overflow solely because they have many messages.
Related work
Related but not identical:
This issue is specifically about the classifier entering context_overflow too early for large context windows due to absolute message-count/token heuristics.
Bug description
agent.error_classifier.classify_api_error()can misclassify generic HTTP 400 errors and server disconnects asFailoverReason.context_overflowin explicitly large-context sessions (for example 1M-token Codex/GPT-5.x sessions), even when the prompt is far below the configured context window.The problematic path is the absolute size/message-count heuristic. On current
main, a generic 400 with many messages is classified as context overflow becausenum_messages > 80, even whenapprox_tokensis only ~74K against a 1M context window.Minimal reproduction
Current result:
Expected result:
A similar issue exists for server disconnect messages with the same low token pressure / high message count shape: the absolute
num_messages > 200branch classifies it ascontext_overflowinstead of a transport/timeout condition.Root cause
Current
agent/error_classifier.pyhas heuristics equivalent to:The absolute fallbacks are reasonable for ~128K/200K context windows, but they are too aggressive for 1M-context sessions. A long session can have hundreds of messages while still being well below the actual context budget.
User impact
This sends non-context errors into the context-overflow recovery path. In long-context Codex sessions, that can cause unnecessary compression and runtime context probe-down from an explicit 1M window to lower probe tiers (currently 256K/128K depending on branch/version), which can lead to repeated compaction and stale handoff pollution.
Suggested fix
Gate the absolute token/message-count heuristics to smaller context windows, and require relative pressure for large-context models. For example:
This preserves existing behavior for smaller context windows while preventing 1M sessions from being classified as overflow solely because they have many messages.
Related work
Related but not identical:
This issue is specifically about the classifier entering
context_overflowtoo early for large context windows due to absolute message-count/token heuristics.