You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When Hermes Agent sessions run long (complex debugging, multi-step coding, research workflows), the conversation history grows until it hits the model's context window limit. Currently, the agent relies on context probing (2M→32K + cache) and tool result trimming (#415) to manage this, but there is no mechanism to summarize and condense older conversation history while preserving key information.
This is inspired by OpenHands'LLMSummarizingCondenser, which replaces old conversation history with LLM-generated summaries when approaching context limits. Their implementation reportedly reduces API costs by ~2x with no degradation in task performance. This pattern is well-validated in production at scale (68k+ star project, $18.8M funding).
Unlike #415 (Insertion-Time Tool Result Trimming), which focuses on trimming individual tool outputs at the point of insertion, this proposal addresses whole-conversation condensation — periodically summarizing blocks of older messages to free up context window space for new work.
Research Findings
How OpenHands Implements Context Condensation
OpenHands provides multiple condenser strategies in their SDK:
Strategy
Description
LLMSummarizingCondenser
Uses a (typically cheaper) LLM to summarize old message blocks into concise summaries
RecentEventsCondenser
Keeps only the N most recent events, dropping older ones
AmortizedCondenser
Hybrid: keeps recent messages verbatim + summarized older history
NoOpCondenser
Passes everything through (for debugging/benchmarking)
The LLMSummarizingCondenser workflow:
Monitor conversation token count against context window limit
When approaching threshold (e.g., 80% of context window):
a. Take the oldest N messages (excluding system prompt)
b. Send them to a cheap/fast LLM (e.g., GPT-4o-mini, Claude Haiku) with a summarization prompt
c. Replace those N messages with a single "summary" message
d. Continue the conversation with the summary + recent messages
The summary preserves: key decisions made, files modified, errors encountered, current task state
The summary discards: verbose tool outputs, intermediate reasoning, failed attempts
Key design decisions:
Uses a separate, cheaper model for summarization (not the main reasoning model)
Summarization is triggered automatically based on token count, not manually
The summary message is marked as a special type so it's distinguishable from regular messages
Incremental: only summarizes what's needed, doesn't re-summarize everything each time
Preserves system prompt: system prompt is never condensed
Why This Matters for Hermes Agent
Current pain points in long sessions:
Context overflow: Agent loses early conversation context, forgets what was discussed
Prompt cache invalidation: Large context changes invalidate expensive prompt caches
Cost accumulation: Every new message re-sends the entire history, including stale content
Performance degradation: Models perform worse with very long contexts (lost-in-the-middle effect)
Current State in Hermes Agent
What we already have:
Context probing (Mar 5 implementation): Dynamically adjusts context from 2M→32K with caching — this handles the outer limit but doesn't condense within it
Banner context display: Shows current context usage
Information loss: Summarization inevitably loses some detail — important nuances may be dropped
Summarization quality: Depends on the model used; cheap models may produce poor summaries
Complexity: Adds another layer to conversation management, harder to debug
Cost of summarization: Each condensation step costs an LLM call (though with a cheap model)
User confusion: Users may not understand why earlier conversation details are "forgotten"
Prompt cache impact: Condensing changes the conversation prefix, which could invalidate existing caches (need to be thoughtful about timing)
Open Questions
Should condensation be opt-in or on-by-default? (Recommend: on-by-default with easy disable)
What's the right threshold for triggering condensation? (75% of context window? 80%?)
Should the user be notified when condensation occurs? (Recommend: yes, brief banner message)
How does this interact with the existing context probing system? (They should be complementary — probing handles the outer limit, condensation manages the inner content)
Should we automatically save key findings to the memory system before condensing them away?
What model should be used for summarization by default? (The user's configured model, or a hardcoded cheap model?)
How should this interact with sub-agent delegation? Should condensed context be passed to sub-agents?
Overview
When Hermes Agent sessions run long (complex debugging, multi-step coding, research workflows), the conversation history grows until it hits the model's context window limit. Currently, the agent relies on context probing (2M→32K + cache) and tool result trimming (#415) to manage this, but there is no mechanism to summarize and condense older conversation history while preserving key information.
This is inspired by OpenHands' LLMSummarizingCondenser, which replaces old conversation history with LLM-generated summaries when approaching context limits. Their implementation reportedly reduces API costs by ~2x with no degradation in task performance. This pattern is well-validated in production at scale (68k+ star project, $18.8M funding).
Unlike #415 (Insertion-Time Tool Result Trimming), which focuses on trimming individual tool outputs at the point of insertion, this proposal addresses whole-conversation condensation — periodically summarizing blocks of older messages to free up context window space for new work.
Research Findings
How OpenHands Implements Context Condensation
OpenHands provides multiple condenser strategies in their SDK:
LLMSummarizingCondenserRecentEventsCondenserAmortizedCondenserNoOpCondenserThe
LLMSummarizingCondenserworkflow:a. Take the oldest N messages (excluding system prompt)
b. Send them to a cheap/fast LLM (e.g., GPT-4o-mini, Claude Haiku) with a summarization prompt
c. Replace those N messages with a single "summary" message
d. Continue the conversation with the summary + recent messages
Key design decisions:
Why This Matters for Hermes Agent
Current pain points in long sessions:
Current State in Hermes Agent
What we already have:
What we lack:
Relevant files:
hermes_state.py— SessionDB and conversation state managementgateway/session.py— SessionStore, handles message historyImplementation Plan
Skill vs. Tool Classification
This should be a core codebase change (neither skill nor tool) because:
What We'd Need
Phased Rollout
Phase 1: Basic Condensation
RecentEventsCondenserequivalent — keep last N messages, drop older onescontext_condensation: { enabled: true, strategy: "recent", keep_last: 50 }Phase 2: LLM-Based Summarization
LLMSummarizingCondenserequivalentcontext_condensation: { strategy: "summarize", model: "gpt-4o-mini", threshold: 0.75 }Phase 3: Smart Condensation
AmortizedCondenserequivalent — hybrid of recent + summarizedPros & Cons
Pros
Cons / Risks
Open Questions
References