Skip to content

Feature: Insertion-Time Tool Result Trimming — Cache-Friendly Context Management #415

@teknium1

Description

@teknium1

Overview

Tool results (file reads, terminal output, web extractions, search results) can be enormous — tens of thousands of characters — but Hermes currently only hard-caps them at 100K chars with head-only cutting (MAX_TOOL_RESULT_CHARS, line 2603 of run_agent.py). Below that 100K cap, results enter the conversation untouched and stay there until the context compressor fires at 85% of the model's context window.

This means a 40K-char grep result or a 25K-char file read sits in full in the message array for the rest of the session, consuming tokens without providing proportional value. By the time compression fires, quality has already degraded from context noise.

Inspired by Utah's pruning system, but redesigned around a critical constraint: prompt caching. Utah retroactively mutates old messages every iteration, which would destroy Anthropic/OpenAI prefix cache hit rates. Instead, this proposal trims tool results at insertion time — before they enter the message array — so cached prefixes are never invalidated.


Research Findings

Utah's Approach (and Why We Can't Copy It)

Utah runs pruneOldToolResults(messages) at the top of every iteration, mutating old tool result messages in-place:

// Soft trim: head+tail for results > 4K chars
// Hard clear: replace all old results when total > 50K chars

This works for Utah because Inngest's step-based execution doesn't use LLM API prefix caching. For Hermes, this is a non-starter. Hermes uses Anthropic prompt caching (prompt_caching.py) which caches conversation prefixes across turns. Mutating any message in the prefix invalidates the cache for all subsequent content, eliminating the ~75% input token cost savings on multi-turn conversations.

The cache-friendly alternative: Trim at insertion time, not retroactively. Once a tool result enters messages, it never changes. The trimmed version IS what gets cached.

Current State in Hermes Agent

In _execute_tool_calls() (line 2599-2610 of run_agent.py):

MAX_TOOL_RESULT_CHARS = 100_000
if len(function_result) > MAX_TOOL_RESULT_CHARS:
    original_len = len(function_result)
    function_result = (
        function_result[:MAX_TOOL_RESULT_CHARS]
        + f"\n\n[Cut: tool response was {original_len:,} chars, "
        f"exceeding the {MAX_TOOL_RESULT_CHARS:,} char limit]"
    )

Problems with current approach:

  • 100K is far too generous — that's ~25K tokens for a single tool result; most models have 128K total context
  • Head-only removal loses the end of the output, which often contains the most important information (final lines of a build log, exit codes, summary sections)
  • No softer threshold — there's no middle ground between "full result" and "100K emergency cap"
  • Individual tools have inconsistent limitsterminal_tool.py has its own limits, web_tools.py summarizes large pages, but there's no unified safety net

The context_compressor.py compression system is the only other defense, but it:

  • Only fires at 85% of context window (reactive, not preventive)
  • Requires an auxiliary LLM call (costs money, adds latency)
  • Summarizes ALL middle messages, not just bloated tool results

Implementation Plan

Skill vs. Tool Classification

This is a core codebase change to run_agent.py (specifically _execute_tool_calls()). Not a skill or tool.

What We'd Need

  1. A head+tail trimming function (utility, ~15 lines)
  2. Configurable threshold below the existing 100K hard cap
  3. Integration at the existing limit point (line 2599-2610)

Phased Rollout

Phase 1: Unified insertion-time trimming

Add a softer trimming tier that runs BEFORE the 100K hard cap:

# Configurable thresholds
TOOL_RESULT_SOFT_TRIM_CHARS = 12_000    # ~3K tokens — trim above this
TOOL_RESULT_HEAD_CHARS = 4_000           # Keep first 4K chars
TOOL_RESULT_TAIL_CHARS = 4_000           # Keep last 4K chars
MAX_TOOL_RESULT_CHARS = 100_000          # Existing hard cap (unchanged)

def _trim_tool_result(result: str, tool_name: str) -> str:
    """Trim large tool results using head+tail strategy.
    
    Preserves the beginning (what was requested, headers, context)
    and end (final output, exit codes, summaries) of tool output.
    """
    if len(result) <= TOOL_RESULT_SOFT_TRIM_CHARS:
        return result
    
    trimmed_chars = len(result) - TOOL_RESULT_HEAD_CHARS - TOOL_RESULT_TAIL_CHARS
    return (
        result[:TOOL_RESULT_HEAD_CHARS]
        + f"\n\n[... {trimmed_chars:,} chars trimmed from {tool_name} output ...]\n\n"
        + result[-TOOL_RESULT_TAIL_CHARS:]
    )

Integration point — replace lines 2599-2610:

# Phase 1: Soft trim large results (head + tail)
function_result = _trim_tool_result(function_result, function_name)

# Phase 2: Hard cap as emergency brake (unchanged)
if len(function_result) > MAX_TOOL_RESULT_CHARS:
    ...

Key property: the tool result is trimmed once, before messages.append(), and never modified again. Prompt caching prefixes remain stable.

Phase 2: Per-tool-type thresholds

Different tools have different information density patterns:

TOOL_TRIM_PROFILES = {
    "terminal": {"soft_trim": 15_000, "head": 2_000, "tail": 8_000},   # tail-heavy: exit codes, final output
    "read_file": {"soft_trim": 10_000, "head": 5_000, "tail": 3_000},  # head-heavy: file start often most relevant
    "search_files": {"soft_trim": 8_000, "head": 4_000, "tail": 4_000}, # balanced: matches spread throughout
    "web_extract": {"soft_trim": 8_000, "head": 4_000, "tail": 2_000},  # head-heavy: summaries at top
    "default": {"soft_trim": 12_000, "head": 4_000, "tail": 4_000},
}

Phase 3: Config exposure and telemetry

  • Add tool_result_trimming config section in cli.py CLI_CONFIG
  • Log trimming events at debug level (tool name, original size, trimmed size)
  • Track cumulative tokens saved per session for observability
  • Consider making the soft threshold adaptive based on remaining context budget

Pros & Cons

Pros

  • Cache-friendly — messages are only written once, never retroactively modified; prompt caching prefixes stay stable
  • Reduces token costs — a 40K-char tool result becomes ~8K at insertion, saving ~8K tokens per bloated result
  • Reduces compression frequency — fewer tokens in messages means the 85% threshold is hit less often, avoiding expensive LLM summarization calls
  • Preserves key information — head+tail captures both the context (what was requested) and the conclusion (exit codes, final output, summaries)
  • Trivial to implement — ~20 lines, fits at the existing limit point
  • Complements existing defenses — sits below the 100K hard cap and above the 85% compression threshold

Cons / Risks

  • Information loss — middle content is discarded; in some cases (e.g., a specific error in a large log) the relevant info is in the middle
  • Threshold tuning — the right TOOL_RESULT_SOFT_TRIM_CHARS depends on the model's context window; 12K chars might be too aggressive for 1M-context models or too lenient for 32K models
  • Interaction with tool-level limits — some tools already limit their own output; double-trimming could over-reduce (mitigation: the outer trim is a no-op if the tool already trimmed below threshold)
  • Less context for complex debugging — when the agent is debugging a large file and needs to reference full output from 5 turns ago, it won't be there

Open Questions

  • Should the soft trim threshold scale with the model's context window? (e.g., min(12_000, context_length * 0.02))
  • Should certain tool names be exempt from trimming? (e.g., clarify, memory, todo — already small, and their full content matters)
  • Should the head/tail ratio be configurable or fixed? (Recommend: fixed defaults with Phase 2 per-tool profiles)
  • Should trimming be logged to the user or just debug-level? (Recommend: debug-level, with a note in the tool result itself via the trimmed indicator)
  • What's the right default threshold? 12K chars (~3K tokens) seems reasonable — that's still a substantial amount of context per tool result

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions