Skip to content

Architecture: separate base vs effective context in overflow recovery #9181

@maelrx

Description

@maelrx

Summary

The current MiniMax delta-only overflow bug exposed a broader design issue in Hermes's context-overflow recovery path.

Today, Hermes often infers overflow semantics from free-form provider error text, then mutates context_length directly as part of recovery. That works for many providers, but it conflates:

  • the model's declared/base context window
  • the session's temporary effective context window during recovery
  • provider-specific overflow semantics (input overflow vs output-cap overflow vs subscription-tier limits vs delta-only overflow messages)

This issue tracks the architectural cleanup needed beyond the tactical MiniMax hotfix in #9170.

What Happened

MiniMax returned:

context window exceeds limit (2013)

In that format, 2013 is the overflow delta, not the actual context limit.

Hermes correctly failed to parse a real context window from the message, but the generic fallback path treated parsed_limit is None as a reason to probe down to the next generic tier.

That caused an incorrect downgrade from 204800 to 128000, which then affected compression thresholds and subsequent session behavior.

Root Problem

The deeper issue is that Hermes currently relies on text-pattern heuristics plus a mutable context_length field to represent multiple distinct concepts:

  1. Base model capability
    • What the model/provider can actually support.
  2. Effective runtime budget
    • A temporary reduction applied for this session or this recovery path.
  3. Provider-specific semantics
    • Whether an error means:
      • input too large
      • output cap too large
      • subscription tier gate
      • provider returned only an overflow delta
      • provider returned the real limit

Those concepts should not all be encoded into a single mutable number with recovery logic spread across run_agent.py.

Why This Matters

When context_length is mutated directly, Hermes also mutates:

  • compression threshold
  • tail budget
  • summary budget
  • downstream recovery heuristics
  • any status/usage reporting derived from the compressor

That means one misclassified overflow can distort the behavior of the rest of the session.

Current Tactical Mitigations

The codebase already contains multiple tactical recoveries for related but different overflow classes, including:

  • output-cap overflow (max_tokens too large) handled without shrinking context
  • Anthropic long-context tier gating handled as a temporary runtime reduction
  • MiniMax delta-only overflow handled as compress-only while preserving known context

These are correct tactical fixes, but they also show that the current recovery architecture is carrying multiple provider-specific semantics inline.

Proposed Roadmap

Phase 1: State hygiene for context windows

Goal: separate base context from temporary reductions.

Deliverables:

  • introduce explicit base/effective context semantics in the context engine/compressor
  • ensure /new and /reset restore the base context window
  • temporary recovery reductions must not overwrite the base context unless the provider confirms a real limit

Status:

Phase 2: Structured overflow recovery policy

Goal: move overflow semantics out of the main agent loop.

Deliverables:

  • extract a dedicated recovery/policy module for context overflow decisions
  • normalize recovery actions into structured outcomes, e.g.:
    • compress_only
    • shrink_context
    • clamp_output_only
    • tier_downgrade
    • unknown_overflow
  • keep run_agent.py as the orchestrator, not the semantic parser

Phase 3: Provider-specific overflow semantics registry

Goal: formalize known provider quirks in one place.

Deliverables:

  • register known provider-specific overflow behaviors:
    • MiniMax delta-only overflow
    • Anthropic long-context tier gating
    • output-cap overflow semantics
    • local/custom endpoint unknown-limit behavior
  • avoid scattering provider-specific string handling across the main loop

Design Constraints

Any cleanup should preserve Hermes's existing architecture and contribution priorities:

  • keep bugfix PRs focused and reviewable
  • preserve plugin compatibility for custom context engines
  • avoid weakening generic parsers just to patch a provider-specific behavior
  • keep provider hardening narrowly scoped unless the semantics are truly generic

Suggested Acceptance Criteria

  • temporary context reductions do not survive /new or /reset
  • provider-confirmed real limits can still update the base context safely
  • overflow classes are represented as structured recovery actions rather than inline string heuristics in the main loop
  • adding a new provider-specific overflow quirk no longer requires touching the core loop directly

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt builderprovider/minimaxMiniMax (Anthropic transport)type/refactorCode restructuring, no behavior change

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions