Skip to content

Generic 400/disconnect errors misclassified as context_overflow in 1M-context sessions #16351

@JayGwod

Description

@JayGwod

Bug description

agent.error_classifier.classify_api_error() can misclassify generic HTTP 400 errors and server disconnects as FailoverReason.context_overflow in explicitly large-context sessions (for example 1M-token Codex/GPT-5.x sessions), even when the prompt is far below the configured context window.

The problematic path is the absolute size/message-count heuristic. On current main, a generic 400 with many messages is classified as context overflow because num_messages > 80, even when approx_tokens is only ~74K against a 1M context window.

Minimal reproduction

from agent.error_classifier import classify_api_error

class FakeHTTP400(Exception):
    status_code = 400
    body = {"error": {"message": "Error"}}
    def __str__(self):
        return "Error"

result = classify_api_error(
    FakeHTTP400(),
    provider="openai-codex",
    model="gpt-5.5",
    approx_tokens=74320,
    context_length=1_000_000,
    num_messages=432,
)

print(result.reason, result.retryable, result.should_compress)

Current result:

FailoverReason.context_overflow True True

Expected result:

FailoverReason.format_error False False

A similar issue exists for server disconnect messages with the same low token pressure / high message count shape: the absolute num_messages > 200 branch classifies it as context_overflow instead of a transport/timeout condition.

Root cause

Current agent/error_classifier.py has heuristics equivalent to:

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or approx_tokens > 120000 or num_messages > 200

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or approx_tokens > 80000 or num_messages > 80

The absolute fallbacks are reasonable for ~128K/200K context windows, but they are too aggressive for 1M-context sessions. A long session can have hundreds of messages while still being well below the actual context budget.

User impact

This sends non-context errors into the context-overflow recovery path. In long-context Codex sessions, that can cause unnecessary compression and runtime context probe-down from an explicit 1M window to lower probe tiers (currently 256K/128K depending on branch/version), which can lead to repeated compaction and stale handoff pollution.

Suggested fix

Gate the absolute token/message-count heuristics to smaller context windows, and require relative pressure for large-context models. For example:

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or (
    context_length <= 256000 and (approx_tokens > 120000 or num_messages > 200)
)

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or (
    context_length <= 256000 and (approx_tokens > 80000 or num_messages > 80)
)

This preserves existing behavior for smaller context windows while preventing 1M sessions from being classified as overflow solely because they have many messages.

Related work

Related but not identical:

This issue is specifically about the classifier entering context_overflow too early for large context windows due to absolute message-count/token heuristics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions