Overview
Tool results (file reads, terminal output, web extractions, search results) can be enormous — tens of thousands of characters — but Hermes currently only hard-caps them at 100K chars with head-only cutting (MAX_TOOL_RESULT_CHARS, line 2603 of run_agent.py). Below that 100K cap, results enter the conversation untouched and stay there until the context compressor fires at 85% of the model's context window.
This means a 40K-char grep result or a 25K-char file read sits in full in the message array for the rest of the session, consuming tokens without providing proportional value. By the time compression fires, quality has already degraded from context noise.
Inspired by Utah's pruning system, but redesigned around a critical constraint: prompt caching. Utah retroactively mutates old messages every iteration, which would destroy Anthropic/OpenAI prefix cache hit rates. Instead, this proposal trims tool results at insertion time — before they enter the message array — so cached prefixes are never invalidated.
Research Findings
Utah's Approach (and Why We Can't Copy It)
Utah runs pruneOldToolResults(messages) at the top of every iteration, mutating old tool result messages in-place:
// Soft trim: head+tail for results > 4K chars
// Hard clear: replace all old results when total > 50K chars
This works for Utah because Inngest's step-based execution doesn't use LLM API prefix caching. For Hermes, this is a non-starter. Hermes uses Anthropic prompt caching (prompt_caching.py) which caches conversation prefixes across turns. Mutating any message in the prefix invalidates the cache for all subsequent content, eliminating the ~75% input token cost savings on multi-turn conversations.
The cache-friendly alternative: Trim at insertion time, not retroactively. Once a tool result enters messages, it never changes. The trimmed version IS what gets cached.
Current State in Hermes Agent
In _execute_tool_calls() (line 2599-2610 of run_agent.py):
MAX_TOOL_RESULT_CHARS = 100_000
if len(function_result) > MAX_TOOL_RESULT_CHARS:
original_len = len(function_result)
function_result = (
function_result[:MAX_TOOL_RESULT_CHARS]
+ f"\n\n[Cut: tool response was {original_len:,} chars, "
f"exceeding the {MAX_TOOL_RESULT_CHARS:,} char limit]"
)
Problems with current approach:
- 100K is far too generous — that's ~25K tokens for a single tool result; most models have 128K total context
- Head-only removal loses the end of the output, which often contains the most important information (final lines of a build log, exit codes, summary sections)
- No softer threshold — there's no middle ground between "full result" and "100K emergency cap"
- Individual tools have inconsistent limits —
terminal_tool.py has its own limits, web_tools.py summarizes large pages, but there's no unified safety net
The context_compressor.py compression system is the only other defense, but it:
- Only fires at 85% of context window (reactive, not preventive)
- Requires an auxiliary LLM call (costs money, adds latency)
- Summarizes ALL middle messages, not just bloated tool results
Implementation Plan
Skill vs. Tool Classification
This is a core codebase change to run_agent.py (specifically _execute_tool_calls()). Not a skill or tool.
What We'd Need
- A head+tail trimming function (utility, ~15 lines)
- Configurable threshold below the existing 100K hard cap
- Integration at the existing limit point (line 2599-2610)
Phased Rollout
Phase 1: Unified insertion-time trimming
Add a softer trimming tier that runs BEFORE the 100K hard cap:
# Configurable thresholds
TOOL_RESULT_SOFT_TRIM_CHARS = 12_000 # ~3K tokens — trim above this
TOOL_RESULT_HEAD_CHARS = 4_000 # Keep first 4K chars
TOOL_RESULT_TAIL_CHARS = 4_000 # Keep last 4K chars
MAX_TOOL_RESULT_CHARS = 100_000 # Existing hard cap (unchanged)
def _trim_tool_result(result: str, tool_name: str) -> str:
"""Trim large tool results using head+tail strategy.
Preserves the beginning (what was requested, headers, context)
and end (final output, exit codes, summaries) of tool output.
"""
if len(result) <= TOOL_RESULT_SOFT_TRIM_CHARS:
return result
trimmed_chars = len(result) - TOOL_RESULT_HEAD_CHARS - TOOL_RESULT_TAIL_CHARS
return (
result[:TOOL_RESULT_HEAD_CHARS]
+ f"\n\n[... {trimmed_chars:,} chars trimmed from {tool_name} output ...]\n\n"
+ result[-TOOL_RESULT_TAIL_CHARS:]
)
Integration point — replace lines 2599-2610:
# Phase 1: Soft trim large results (head + tail)
function_result = _trim_tool_result(function_result, function_name)
# Phase 2: Hard cap as emergency brake (unchanged)
if len(function_result) > MAX_TOOL_RESULT_CHARS:
...
Key property: the tool result is trimmed once, before messages.append(), and never modified again. Prompt caching prefixes remain stable.
Phase 2: Per-tool-type thresholds
Different tools have different information density patterns:
TOOL_TRIM_PROFILES = {
"terminal": {"soft_trim": 15_000, "head": 2_000, "tail": 8_000}, # tail-heavy: exit codes, final output
"read_file": {"soft_trim": 10_000, "head": 5_000, "tail": 3_000}, # head-heavy: file start often most relevant
"search_files": {"soft_trim": 8_000, "head": 4_000, "tail": 4_000}, # balanced: matches spread throughout
"web_extract": {"soft_trim": 8_000, "head": 4_000, "tail": 2_000}, # head-heavy: summaries at top
"default": {"soft_trim": 12_000, "head": 4_000, "tail": 4_000},
}
Phase 3: Config exposure and telemetry
- Add
tool_result_trimming config section in cli.py CLI_CONFIG
- Log trimming events at debug level (tool name, original size, trimmed size)
- Track cumulative tokens saved per session for observability
- Consider making the soft threshold adaptive based on remaining context budget
Pros & Cons
Pros
- Cache-friendly — messages are only written once, never retroactively modified; prompt caching prefixes stay stable
- Reduces token costs — a 40K-char tool result becomes ~8K at insertion, saving ~8K tokens per bloated result
- Reduces compression frequency — fewer tokens in messages means the 85% threshold is hit less often, avoiding expensive LLM summarization calls
- Preserves key information — head+tail captures both the context (what was requested) and the conclusion (exit codes, final output, summaries)
- Trivial to implement — ~20 lines, fits at the existing limit point
- Complements existing defenses — sits below the 100K hard cap and above the 85% compression threshold
Cons / Risks
- Information loss — middle content is discarded; in some cases (e.g., a specific error in a large log) the relevant info is in the middle
- Threshold tuning — the right
TOOL_RESULT_SOFT_TRIM_CHARS depends on the model's context window; 12K chars might be too aggressive for 1M-context models or too lenient for 32K models
- Interaction with tool-level limits — some tools already limit their own output; double-trimming could over-reduce (mitigation: the outer trim is a no-op if the tool already trimmed below threshold)
- Less context for complex debugging — when the agent is debugging a large file and needs to reference full output from 5 turns ago, it won't be there
Open Questions
- Should the soft trim threshold scale with the model's context window? (e.g.,
min(12_000, context_length * 0.02))
- Should certain tool names be exempt from trimming? (e.g.,
clarify, memory, todo — already small, and their full content matters)
- Should the head/tail ratio be configurable or fixed? (Recommend: fixed defaults with Phase 2 per-tool profiles)
- Should trimming be logged to the user or just debug-level? (Recommend: debug-level, with a note in the tool result itself via the trimmed indicator)
- What's the right default threshold? 12K chars (~3K tokens) seems reasonable — that's still a substantial amount of context per tool result
References
Overview
Tool results (file reads, terminal output, web extractions, search results) can be enormous — tens of thousands of characters — but Hermes currently only hard-caps them at 100K chars with head-only cutting (
MAX_TOOL_RESULT_CHARS, line 2603 ofrun_agent.py). Below that 100K cap, results enter the conversation untouched and stay there until the context compressor fires at 85% of the model's context window.This means a 40K-char
grepresult or a 25K-char file read sits in full in the message array for the rest of the session, consuming tokens without providing proportional value. By the time compression fires, quality has already degraded from context noise.Inspired by Utah's pruning system, but redesigned around a critical constraint: prompt caching. Utah retroactively mutates old messages every iteration, which would destroy Anthropic/OpenAI prefix cache hit rates. Instead, this proposal trims tool results at insertion time — before they enter the message array — so cached prefixes are never invalidated.
Research Findings
Utah's Approach (and Why We Can't Copy It)
Utah runs
pruneOldToolResults(messages)at the top of every iteration, mutating old tool result messages in-place:This works for Utah because Inngest's step-based execution doesn't use LLM API prefix caching. For Hermes, this is a non-starter. Hermes uses Anthropic prompt caching (
prompt_caching.py) which caches conversation prefixes across turns. Mutating any message in the prefix invalidates the cache for all subsequent content, eliminating the ~75% input token cost savings on multi-turn conversations.The cache-friendly alternative: Trim at insertion time, not retroactively. Once a tool result enters
messages, it never changes. The trimmed version IS what gets cached.Current State in Hermes Agent
In
_execute_tool_calls()(line 2599-2610 ofrun_agent.py):Problems with current approach:
terminal_tool.pyhas its own limits,web_tools.pysummarizes large pages, but there's no unified safety netThe
context_compressor.pycompression system is the only other defense, but it:Implementation Plan
Skill vs. Tool Classification
This is a core codebase change to
run_agent.py(specifically_execute_tool_calls()). Not a skill or tool.What We'd Need
Phased Rollout
Phase 1: Unified insertion-time trimming
Add a softer trimming tier that runs BEFORE the 100K hard cap:
Integration point — replace lines 2599-2610:
Key property: the tool result is trimmed once, before
messages.append(), and never modified again. Prompt caching prefixes remain stable.Phase 2: Per-tool-type thresholds
Different tools have different information density patterns:
Phase 3: Config exposure and telemetry
tool_result_trimmingconfig section incli.pyCLI_CONFIGPros & Cons
Pros
Cons / Risks
TOOL_RESULT_SOFT_TRIM_CHARSdepends on the model's context window; 12K chars might be too aggressive for 1M-context models or too lenient for 32K modelsOpen Questions
min(12_000, context_length * 0.02))clarify,memory,todo— already small, and their full content matters)References
run_agent.pylines 2599-2617 (tool result insertion + 100K hard cap)agent/prompt_caching.py(system_and_3 strategy, cache breakpoints on system + last 3 messages)