Skip to content

Feature: LLM-Based Context Condensation for Long Sessions #480

@teknium1

Description

@teknium1

Overview

When Hermes Agent sessions run long (complex debugging, multi-step coding, research workflows), the conversation history grows until it hits the model's context window limit. Currently, the agent relies on context probing (2M→32K + cache) and tool result trimming (#415) to manage this, but there is no mechanism to summarize and condense older conversation history while preserving key information.

This is inspired by OpenHands' LLMSummarizingCondenser, which replaces old conversation history with LLM-generated summaries when approaching context limits. Their implementation reportedly reduces API costs by ~2x with no degradation in task performance. This pattern is well-validated in production at scale (68k+ star project, $18.8M funding).

Unlike #415 (Insertion-Time Tool Result Trimming), which focuses on trimming individual tool outputs at the point of insertion, this proposal addresses whole-conversation condensation — periodically summarizing blocks of older messages to free up context window space for new work.


Research Findings

How OpenHands Implements Context Condensation

OpenHands provides multiple condenser strategies in their SDK:

Strategy Description
LLMSummarizingCondenser Uses a (typically cheaper) LLM to summarize old message blocks into concise summaries
RecentEventsCondenser Keeps only the N most recent events, dropping older ones
AmortizedCondenser Hybrid: keeps recent messages verbatim + summarized older history
NoOpCondenser Passes everything through (for debugging/benchmarking)

The LLMSummarizingCondenser workflow:

  1. Monitor conversation token count against context window limit
  2. When approaching threshold (e.g., 80% of context window):
    a. Take the oldest N messages (excluding system prompt)
    b. Send them to a cheap/fast LLM (e.g., GPT-4o-mini, Claude Haiku) with a summarization prompt
    c. Replace those N messages with a single "summary" message
    d. Continue the conversation with the summary + recent messages
  3. The summary preserves: key decisions made, files modified, errors encountered, current task state
  4. The summary discards: verbose tool outputs, intermediate reasoning, failed attempts

Key design decisions:

  • Uses a separate, cheaper model for summarization (not the main reasoning model)
  • Summarization is triggered automatically based on token count, not manually
  • The summary message is marked as a special type so it's distinguishable from regular messages
  • Incremental: only summarizes what's needed, doesn't re-summarize everything each time
  • Preserves system prompt: system prompt is never condensed

Why This Matters for Hermes Agent

Current pain points in long sessions:

  1. Context overflow: Agent loses early conversation context, forgets what was discussed
  2. Prompt cache invalidation: Large context changes invalidate expensive prompt caches
  3. Cost accumulation: Every new message re-sends the entire history, including stale content
  4. Performance degradation: Models perform worse with very long contexts (lost-in-the-middle effect)

Current State in Hermes Agent

What we already have:

  • Context probing (Mar 5 implementation): Dynamically adjusts context from 2M→32K with caching — this handles the outer limit but doesn't condense within it
  • Banner context display: Shows current context usage
  • Tool result trimming (Feature: Insertion-Time Tool Result Trimming — Cache-Friendly Context Management #415): Trims individual tool outputs at insertion — complementary, not overlapping
  • Memory system: Persistent memory across sessions — handles cross-session recall but not within-session conversation management
  • Session search: Recalls past sessions — also cross-session, not within-session

What we lack:

  • No mechanism to summarize/condense the active conversation history
  • No way to use a cheaper model for summarization while keeping the main model for reasoning
  • No automatic trigger when approaching context limits
  • Long sessions either lose context or become expensive

Relevant files:

  • hermes_state.py — SessionDB and conversation state management
  • gateway/session.py — SessionStore, handles message history
  • Context probing logic (wherever implemented in Mar 5 work)

Implementation Plan

Skill vs. Tool Classification

This should be a core codebase change (neither skill nor tool) because:

  • It requires integration with the conversation state management system
  • It needs access to token counting and context window monitoring
  • It must intercept the message pipeline before LLM calls
  • It needs to invoke a secondary LLM (multi-model routing)

What We'd Need

  1. Token counter: Accurate count of current conversation tokens (may already exist for context probing)
  2. Condensation trigger: Threshold-based activation (e.g., 75% of context window)
  3. Summarization prompt: Carefully crafted prompt that preserves essential context
  4. Cheap model routing: Ability to call a different, cheaper model for summarization
  5. Summary message type: Distinguishable from regular messages in the conversation history
  6. Configuration: User-controllable settings (threshold, model, enable/disable)

Phased Rollout

Phase 1: Basic Condensation

  • Implement RecentEventsCondenser equivalent — keep last N messages, drop older ones
  • Add token counting to conversation state
  • Trigger when conversation exceeds configurable threshold (default: 75% of context window)
  • Simple and deterministic — no LLM call needed
  • Config: context_condensation: { enabled: true, strategy: "recent", keep_last: 50 }

Phase 2: LLM-Based Summarization

  • Implement LLMSummarizingCondenser equivalent
  • Use a configurable cheap model (default: same model, but recommend cheaper)
  • Summarization prompt preserves: task state, decisions, file changes, errors
  • Summary inserted as a marked system-like message
  • Config: context_condensation: { strategy: "summarize", model: "gpt-4o-mini", threshold: 0.75 }

Phase 3: Smart Condensation

  • Implement AmortizedCondenser equivalent — hybrid of recent + summarized
  • Context-aware summarization (weight coding context vs. conversation vs. research differently)
  • Integration with memory system (auto-save important findings to memory before condensing)
  • Metrics: track how much was condensed, what was preserved, cost savings
  • User notification: "Condensed 45 messages into summary to free context space"

Pros & Cons

Pros

  • Longer effective sessions: Users can work on complex tasks without hitting context limits
  • Cost reduction: ~2x reduction in API costs for long sessions (per OpenHands data)
  • Better performance: Avoids "lost in the middle" effect with very long contexts
  • Cache-friendly: Smaller, more stable conversation history improves prompt cache hit rates
  • Proven pattern: Battle-tested in OpenHands at scale
  • Complementary: Works alongside Feature: Insertion-Time Tool Result Trimming — Cache-Friendly Context Management #415 (tool trimming), memory system, and session search

Cons / Risks

  • Information loss: Summarization inevitably loses some detail — important nuances may be dropped
  • Summarization quality: Depends on the model used; cheap models may produce poor summaries
  • Complexity: Adds another layer to conversation management, harder to debug
  • Cost of summarization: Each condensation step costs an LLM call (though with a cheap model)
  • User confusion: Users may not understand why earlier conversation details are "forgotten"
  • Prompt cache impact: Condensing changes the conversation prefix, which could invalidate existing caches (need to be thoughtful about timing)

Open Questions

  • Should condensation be opt-in or on-by-default? (Recommend: on-by-default with easy disable)
  • What's the right threshold for triggering condensation? (75% of context window? 80%?)
  • Should the user be notified when condensation occurs? (Recommend: yes, brief banner message)
  • How does this interact with the existing context probing system? (They should be complementary — probing handles the outer limit, condensation manages the inner content)
  • Should we automatically save key findings to the memory system before condensing them away?
  • What model should be used for summarization by default? (The user's configured model, or a hardcoded cheap model?)
  • How should this interact with sub-agent delegation? Should condensed context be passed to sub-agents?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions