You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hermes uses a single-phase context compression: when context exceeds 85% of the model window, summarize all middle turns with an LLM call. Kilocode uses a two-phase approach that is both cheaper and produces better results:
Phase 1 (Prune): Walk backwards through messages. Keep the last 40K tokens of tool outputs untouched. Mark all older tool outputs as "compacted" (elided from context). No LLM call required — just removal. Only prune if >20K tokens can be reclaimed.
Phase 2 (Compact): If pruning isn't enough, run LLM summarization with a structured template producing actionable context, not a narrative blob.
Walk backward through message parts.
Skip the last 2 turns entirely (always protected).
Keep 40K tokens (PRUNE_PROTECT) of recent tool call outputs.
Beyond that, set time.compacted=true on old tool outputs.
Protected tools (e.g., "skill") are never pruned.
Only prune if >20K tokens (PRUNE_MINIMUM) would be reclaimed.
The key insight: tool outputs are the bulkiest content in a conversation. A single terminal or search_files result can be 10-50K characters. Pruning these first recovers enormous amounts of context without losing the conversational structure (user messages, assistant reasoning, tool calls are all preserved — only tool results are removed).
Kilocode's Structured Compaction Template
When pruning isn't enough, the compaction agent uses this template:
## Goal
[What the user is trying to accomplish]
## Instructions
[Standing instructions from the user]
## Discoveries
[Key findings, relevant code, data discovered]
## Accomplished
[What has been completed so far]
## Relevant files/directories
[Files and paths that matter for the ongoing task]
After compaction, injects: "Continue if you have next steps, or stop and ask for clarification."
Current Hermes Approach
agent/context_compressor.py:
Single phase: protect first 3 + last 4 turns, summarize middle with LLM
Generic prompt: "Summarize these conversation turns concisely..." covering actions, results, decisions, data
Produces a [CONTEXT SUMMARY] narrative blob
No tool-output-specific pruning
No structured template for resumption
Implementation Plan
Classification
Core codebase change to agent/context_compressor.py and run_agent.py. Not a skill or tool.
Phase 1: Tool Output Pruning (no LLM call)
Add a _prune_tool_outputs() method that runs BEFORE the current LLM-based compression:
PRUNE_PROTECT_TOKENS=40_000# Keep last 40K tokens of tool outputsPRUNE_MINIMUM_TOKENS=20_000# Only prune if we reclaim >20K tokensNEVER_PRUNE_TOOLS= {"clarify", "memory", "skill_view", "todo"}
def_prune_tool_outputs(self, messages: list) ->tuple[list, int]:
"""Remove old tool outputs while preserving recent ones. Returns (pruned_messages, tokens_saved)."""# Walk backward, accumulate tool output token estimates# After PRUNE_PROTECT reached, replace old tool content with# "[Tool output pruned — was N chars]"
...
Integrate before the existing compression check:
ifself.should_compress_preflight(messages):
messages, saved=self._prune_tool_outputs(messages)
ifself.should_compress_preflight(messages):
# Still over threshold — do full LLM compressionmessages=self.compress(messages)
Phase 2: Structured Compaction Template
Replace the generic summarization prompt with the structured template:
COMPACTION_TEMPLATE="""Summarize the compressed conversation turns into a structured resumption context.## Goal[What is the user trying to accomplish?]## Standing Instructions[Any persistent instructions or constraints from the user]## Key Discoveries[Important findings, relevant code, data, error messages]## Accomplished So Far[What has been completed — be specific about files changed, commands run]## Relevant Files & Paths[List all file paths, URLs, and resources that matter]## Next Steps[What was the agent about to do when compression triggered?]"""
Phase 3: Adaptive thresholds
Scale PRUNE_PROTECT based on model context window:
128K models: protect last 40K tokens
32K models: protect last 10K tokens
1M models: protect last 100K tokens
Pros & Cons
Pros
Phase 1 is free — no LLM call, just string replacement. Saves both money and latency.
Preserves conversation structure — User messages, assistant reasoning, and tool call names stay intact. Only the bulky output blobs are removed.
Structured template produces actionable resumption context vs a narrative blob that loses task structure
Prompt-cache friendly — pruning could mark old tool results at insertion time, preserving prefix cache
Composable — prune first, then compress only if still needed. Often pruning alone is enough.
Cons
Information loss — Old tool outputs may contain data the agent needs to reference later
Threshold tuning — 40K tokens of protection may be too much for small-context models or too little for large ones
Overview
Hermes uses a single-phase context compression: when context exceeds 85% of the model window, summarize all middle turns with an LLM call. Kilocode uses a two-phase approach that is both cheaper and produces better results:
Phase 1 (Prune): Walk backwards through messages. Keep the last 40K tokens of tool outputs untouched. Mark all older tool outputs as "compacted" (elided from context). No LLM call required — just removal. Only prune if >20K tokens can be reclaimed.
Phase 2 (Compact): If pruning isn't enough, run LLM summarization with a structured template producing actionable context, not a narrative blob.
Source:
packages/opencode/src/session/compaction.tsResearch Findings
Kilocode's Pruning Phase
The key insight: tool outputs are the bulkiest content in a conversation. A single
terminalorsearch_filesresult can be 10-50K characters. Pruning these first recovers enormous amounts of context without losing the conversational structure (user messages, assistant reasoning, tool calls are all preserved — only tool results are removed).Kilocode's Structured Compaction Template
When pruning isn't enough, the compaction agent uses this template:
After compaction, injects: "Continue if you have next steps, or stop and ask for clarification."
Current Hermes Approach
agent/context_compressor.py:[CONTEXT SUMMARY]narrative blobImplementation Plan
Classification
Core codebase change to
agent/context_compressor.pyandrun_agent.py. Not a skill or tool.Phase 1: Tool Output Pruning (no LLM call)
Add a
_prune_tool_outputs()method that runs BEFORE the current LLM-based compression:Integrate before the existing compression check:
Phase 2: Structured Compaction Template
Replace the generic summarization prompt with the structured template:
Phase 3: Adaptive thresholds
Scale PRUNE_PROTECT based on model context window:
Pros & Cons
Pros
Cons
References
agent/context_compressor.py— Current single-phase compression