Skip to content

Feature: Two-Phase Context Management — Prune Tool Outputs Before Full Compaction (inspired by Kilocode) #513

@teknium1

Description

@teknium1

Overview

Hermes uses a single-phase context compression: when context exceeds 85% of the model window, summarize all middle turns with an LLM call. Kilocode uses a two-phase approach that is both cheaper and produces better results:

Phase 1 (Prune): Walk backwards through messages. Keep the last 40K tokens of tool outputs untouched. Mark all older tool outputs as "compacted" (elided from context). No LLM call required — just removal. Only prune if >20K tokens can be reclaimed.

Phase 2 (Compact): If pruning isn't enough, run LLM summarization with a structured template producing actionable context, not a narrative blob.

Source: packages/opencode/src/session/compaction.ts


Research Findings

Kilocode's Pruning Phase

Walk backward through message parts.
Skip the last 2 turns entirely (always protected).
Keep 40K tokens (PRUNE_PROTECT) of recent tool call outputs.
Beyond that, set time.compacted=true on old tool outputs.
Protected tools (e.g., "skill") are never pruned.
Only prune if >20K tokens (PRUNE_MINIMUM) would be reclaimed.

The key insight: tool outputs are the bulkiest content in a conversation. A single terminal or search_files result can be 10-50K characters. Pruning these first recovers enormous amounts of context without losing the conversational structure (user messages, assistant reasoning, tool calls are all preserved — only tool results are removed).

Kilocode's Structured Compaction Template

When pruning isn't enough, the compaction agent uses this template:

## Goal
[What the user is trying to accomplish]

## Instructions
[Standing instructions from the user]

## Discoveries
[Key findings, relevant code, data discovered]

## Accomplished
[What has been completed so far]

## Relevant files/directories
[Files and paths that matter for the ongoing task]

After compaction, injects: "Continue if you have next steps, or stop and ask for clarification."

Current Hermes Approach

agent/context_compressor.py:

  • Single phase: protect first 3 + last 4 turns, summarize middle with LLM
  • Generic prompt: "Summarize these conversation turns concisely..." covering actions, results, decisions, data
  • Produces a [CONTEXT SUMMARY] narrative blob
  • No tool-output-specific pruning
  • No structured template for resumption

Implementation Plan

Classification

Core codebase change to agent/context_compressor.py and run_agent.py. Not a skill or tool.

Phase 1: Tool Output Pruning (no LLM call)

Add a _prune_tool_outputs() method that runs BEFORE the current LLM-based compression:

PRUNE_PROTECT_TOKENS = 40_000  # Keep last 40K tokens of tool outputs
PRUNE_MINIMUM_TOKENS = 20_000  # Only prune if we reclaim >20K tokens
NEVER_PRUNE_TOOLS = {"clarify", "memory", "skill_view", "todo"}

def _prune_tool_outputs(self, messages: list) -> tuple[list, int]:
    """Remove old tool outputs while preserving recent ones.
    Returns (pruned_messages, tokens_saved)."""
    # Walk backward, accumulate tool output token estimates
    # After PRUNE_PROTECT reached, replace old tool content with
    # "[Tool output pruned — was N chars]"
    ...

Integrate before the existing compression check:

if self.should_compress_preflight(messages):
    messages, saved = self._prune_tool_outputs(messages)
    if self.should_compress_preflight(messages):
        # Still over threshold — do full LLM compression
        messages = self.compress(messages)

Phase 2: Structured Compaction Template

Replace the generic summarization prompt with the structured template:

COMPACTION_TEMPLATE = """Summarize the compressed conversation turns into a structured resumption context.

## Goal
[What is the user trying to accomplish?]

## Standing Instructions
[Any persistent instructions or constraints from the user]

## Key Discoveries
[Important findings, relevant code, data, error messages]

## Accomplished So Far
[What has been completed — be specific about files changed, commands run]

## Relevant Files & Paths
[List all file paths, URLs, and resources that matter]

## Next Steps
[What was the agent about to do when compression triggered?]
"""

Phase 3: Adaptive thresholds

Scale PRUNE_PROTECT based on model context window:

  • 128K models: protect last 40K tokens
  • 32K models: protect last 10K tokens
  • 1M models: protect last 100K tokens

Pros & Cons

Pros

  • Phase 1 is free — no LLM call, just string replacement. Saves both money and latency.
  • Preserves conversation structure — User messages, assistant reasoning, and tool call names stay intact. Only the bulky output blobs are removed.
  • Structured template produces actionable resumption context vs a narrative blob that loses task structure
  • Prompt-cache friendly — pruning could mark old tool results at insertion time, preserving prefix cache
  • Composable — prune first, then compress only if still needed. Often pruning alone is enough.

Cons

  • Information loss — Old tool outputs may contain data the agent needs to reference later
  • Threshold tuning — 40K tokens of protection may be too much for small-context models or too little for large ones
  • Interaction with Feature: Insertion-Time Tool Result Trimming — Cache-Friendly Context Management #415 — If insertion-time trimming lands first, tool outputs will already be smaller, reducing the need for pruning (but the two are complementary, not conflicting)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions