Feature: LLM-Based Context Condensation for Long Sessions

## Overview

When Hermes Agent sessions run long (complex debugging, multi-step coding, research workflows), the conversation history grows until it hits the model's context window limit. Currently, the agent relies on context probing (2M→32K + cache) and tool result trimming (#415) to manage this, but there is no mechanism to **summarize and condense older conversation history** while preserving key information.

This is inspired by [OpenHands'](https://github.com/All-Hands-AI/OpenHands) **LLMSummarizingCondenser**, which replaces old conversation history with LLM-generated summaries when approaching context limits. Their implementation reportedly reduces API costs by ~2x with no degradation in task performance. This pattern is well-validated in production at scale (68k+ star project, $18.8M funding).

Unlike #415 (Insertion-Time Tool Result Trimming), which focuses on trimming individual tool outputs at the point of insertion, this proposal addresses **whole-conversation condensation** — periodically summarizing blocks of older messages to free up context window space for new work.

---

## Research Findings

### How OpenHands Implements Context Condensation

OpenHands provides multiple condenser strategies in their SDK:

| Strategy | Description |
|----------|-------------|
| `LLMSummarizingCondenser` | Uses a (typically cheaper) LLM to summarize old message blocks into concise summaries |
| `RecentEventsCondenser` | Keeps only the N most recent events, dropping older ones |
| `AmortizedCondenser` | Hybrid: keeps recent messages verbatim + summarized older history |
| `NoOpCondenser` | Passes everything through (for debugging/benchmarking) |

**The `LLMSummarizingCondenser` workflow:**
1. Monitor conversation token count against context window limit
2. When approaching threshold (e.g., 80% of context window):
   a. Take the oldest N messages (excluding system prompt)
   b. Send them to a cheap/fast LLM (e.g., GPT-4o-mini, Claude Haiku) with a summarization prompt
   c. Replace those N messages with a single "summary" message
   d. Continue the conversation with the summary + recent messages
3. The summary preserves: key decisions made, files modified, errors encountered, current task state
4. The summary discards: verbose tool outputs, intermediate reasoning, failed attempts

**Key design decisions:**
- Uses a **separate, cheaper model** for summarization (not the main reasoning model)
- Summarization is **triggered automatically** based on token count, not manually
- The summary message is marked as a special type so it's distinguishable from regular messages
- **Incremental**: only summarizes what's needed, doesn't re-summarize everything each time
- **Preserves system prompt**: system prompt is never condensed

### Why This Matters for Hermes Agent

Current pain points in long sessions:
1. **Context overflow**: Agent loses early conversation context, forgets what was discussed
2. **Prompt cache invalidation**: Large context changes invalidate expensive prompt caches
3. **Cost accumulation**: Every new message re-sends the entire history, including stale content
4. **Performance degradation**: Models perform worse with very long contexts (lost-in-the-middle effect)

---

## Current State in Hermes Agent

**What we already have:**
- **Context probing** (Mar 5 implementation): Dynamically adjusts context from 2M→32K with caching — this handles the *outer* limit but doesn't condense within it
- **Banner context display**: Shows current context usage
- **Tool result trimming** (#415): Trims individual tool outputs at insertion — complementary, not overlapping
- **Memory system**: Persistent memory across sessions — handles *cross-session* recall but not *within-session* conversation management
- **Session search**: Recalls past sessions — also cross-session, not within-session

**What we lack:**
- No mechanism to summarize/condense the active conversation history
- No way to use a cheaper model for summarization while keeping the main model for reasoning
- No automatic trigger when approaching context limits
- Long sessions either lose context or become expensive

**Relevant files:**
- `hermes_state.py` — SessionDB and conversation state management
- `gateway/session.py` — SessionStore, handles message history
- Context probing logic (wherever implemented in Mar 5 work)

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **core codebase change** (neither skill nor tool) because:
- It requires integration with the conversation state management system
- It needs access to token counting and context window monitoring
- It must intercept the message pipeline before LLM calls
- It needs to invoke a secondary LLM (multi-model routing)

### What We'd Need

1. **Token counter**: Accurate count of current conversation tokens (may already exist for context probing)
2. **Condensation trigger**: Threshold-based activation (e.g., 75% of context window)
3. **Summarization prompt**: Carefully crafted prompt that preserves essential context
4. **Cheap model routing**: Ability to call a different, cheaper model for summarization
5. **Summary message type**: Distinguishable from regular messages in the conversation history
6. **Configuration**: User-controllable settings (threshold, model, enable/disable)

### Phased Rollout

**Phase 1: Basic Condensation**
- Implement `RecentEventsCondenser` equivalent — keep last N messages, drop older ones
- Add token counting to conversation state
- Trigger when conversation exceeds configurable threshold (default: 75% of context window)
- Simple and deterministic — no LLM call needed
- Config: `context_condensation: { enabled: true, strategy: "recent", keep_last: 50 }`

**Phase 2: LLM-Based Summarization**
- Implement `LLMSummarizingCondenser` equivalent
- Use a configurable cheap model (default: same model, but recommend cheaper)
- Summarization prompt preserves: task state, decisions, file changes, errors
- Summary inserted as a marked system-like message
- Config: `context_condensation: { strategy: "summarize", model: "gpt-4o-mini", threshold: 0.75 }`

**Phase 3: Smart Condensation**
- Implement `AmortizedCondenser` equivalent — hybrid of recent + summarized
- Context-aware summarization (weight coding context vs. conversation vs. research differently)
- Integration with memory system (auto-save important findings to memory before condensing)
- Metrics: track how much was condensed, what was preserved, cost savings
- User notification: "Condensed 45 messages into summary to free context space"

---

## Pros & Cons

### Pros
- **Longer effective sessions**: Users can work on complex tasks without hitting context limits
- **Cost reduction**: ~2x reduction in API costs for long sessions (per OpenHands data)
- **Better performance**: Avoids "lost in the middle" effect with very long contexts
- **Cache-friendly**: Smaller, more stable conversation history improves prompt cache hit rates
- **Proven pattern**: Battle-tested in OpenHands at scale
- **Complementary**: Works alongside #415 (tool trimming), memory system, and session search

### Cons / Risks
- **Information loss**: Summarization inevitably loses some detail — important nuances may be dropped
- **Summarization quality**: Depends on the model used; cheap models may produce poor summaries
- **Complexity**: Adds another layer to conversation management, harder to debug
- **Cost of summarization**: Each condensation step costs an LLM call (though with a cheap model)
- **User confusion**: Users may not understand why earlier conversation details are "forgotten"
- **Prompt cache impact**: Condensing changes the conversation prefix, which could *invalidate* existing caches (need to be thoughtful about timing)

---

## Open Questions

- Should condensation be opt-in or on-by-default? (Recommend: on-by-default with easy disable)
- What's the right threshold for triggering condensation? (75% of context window? 80%?)
- Should the user be notified when condensation occurs? (Recommend: yes, brief banner message)
- How does this interact with the existing context probing system? (They should be complementary — probing handles the outer limit, condensation manages the inner content)
- Should we automatically save key findings to the memory system before condensing them away?
- What model should be used for summarization by default? (The user's configured model, or a hardcoded cheap model?)
- How should this interact with sub-agent delegation? Should condensed context be passed to sub-agents?

---

## References

- OpenHands Architecture: https://docs.openhands.dev/overview/architecture
- OpenHands GitHub (condenser code): https://github.com/All-Hands-AI/OpenHands
- Related issue #415: Insertion-Time Tool Result Trimming
- Related issue #346: Structured Memory System
- Related issue #377: Shared Memory Pools Between Sub-Agents
- Related issue #357: Tree-Structured Sessions with Branching
- OpenHands reports ~2x cost reduction with LLMSummarizingCondenser
- "Lost in the Middle" paper: https://arxiv.org/abs/2307.03172

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: LLM-Based Context Condensation for Long Sessions #480

Overview

Research Findings

How OpenHands Implements Context Condensation

Why This Matters for Hermes Agent

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Strategy	Description
`LLMSummarizingCondenser`	Uses a (typically cheaper) LLM to summarize old message blocks into concise summaries
`RecentEventsCondenser`	Keeps only the N most recent events, dropping older ones
`AmortizedCondenser`	Hybrid: keeps recent messages verbatim + summarized older history
`NoOpCondenser`	Passes everything through (for debugging/benchmarking)

Feature: LLM-Based Context Condensation for Long Sessions #480

Description

Overview

Research Findings

How OpenHands Implements Context Condensation

Why This Matters for Hermes Agent

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions