Feature: Insertion-Time Tool Result Trimming — Cache-Friendly Context Management

## Overview

Tool results (file reads, terminal output, web extractions, search results) can be enormous — tens of thousands of characters — but Hermes currently only hard-caps them at 100K chars with head-only cutting (`MAX_TOOL_RESULT_CHARS`, line 2603 of `run_agent.py`). Below that 100K cap, results enter the conversation untouched and stay there until the context compressor fires at 85% of the model's context window.

This means a 40K-char `grep` result or a 25K-char file read sits in full in the message array for the rest of the session, consuming tokens without providing proportional value. By the time compression fires, quality has already degraded from context noise.

Inspired by [Utah's pruning system](https://github.com/inngest/utah/blob/main/src/agent-loop.ts), but redesigned around a critical constraint: **prompt caching**. Utah retroactively mutates old messages every iteration, which would destroy Anthropic/OpenAI prefix cache hit rates. Instead, this proposal trims tool results **at insertion time** — before they enter the message array — so cached prefixes are never invalidated.

---

## Research Findings

### Utah's Approach (and Why We Can't Copy It)

Utah runs `pruneOldToolResults(messages)` at the top of every iteration, mutating old tool result messages in-place:

```typescript
// Soft trim: head+tail for results > 4K chars
// Hard clear: replace all old results when total > 50K chars
```

This works for Utah because Inngest's step-based execution doesn't use LLM API prefix caching. **For Hermes, this is a non-starter.** Hermes uses Anthropic prompt caching (`prompt_caching.py`) which caches conversation prefixes across turns. Mutating any message in the prefix invalidates the cache for all subsequent content, eliminating the ~75% input token cost savings on multi-turn conversations.

**The cache-friendly alternative:** Trim at insertion time, not retroactively. Once a tool result enters `messages`, it never changes. The trimmed version IS what gets cached.

### Current State in Hermes Agent

In `_execute_tool_calls()` (line 2599-2610 of `run_agent.py`):

```python
MAX_TOOL_RESULT_CHARS = 100_000
if len(function_result) > MAX_TOOL_RESULT_CHARS:
    original_len = len(function_result)
    function_result = (
        function_result[:MAX_TOOL_RESULT_CHARS]
        + f"\n\n[Cut: tool response was {original_len:,} chars, "
        f"exceeding the {MAX_TOOL_RESULT_CHARS:,} char limit]"
    )
```

Problems with current approach:
- **100K is far too generous** — that's ~25K tokens for a single tool result; most models have 128K total context
- **Head-only removal** loses the end of the output, which often contains the most important information (final lines of a build log, exit codes, summary sections)
- **No softer threshold** — there's no middle ground between "full result" and "100K emergency cap"
- **Individual tools have inconsistent limits** — `terminal_tool.py` has its own limits, `web_tools.py` summarizes large pages, but there's no unified safety net

The `context_compressor.py` compression system is the only other defense, but it:
- Only fires at 85% of context window (reactive, not preventive)
- Requires an auxiliary LLM call (costs money, adds latency)
- Summarizes ALL middle messages, not just bloated tool results

---

## Implementation Plan

### Skill vs. Tool Classification

This is a **core codebase change** to `run_agent.py` (specifically `_execute_tool_calls()`). Not a skill or tool.

### What We'd Need

1. A head+tail trimming function (utility, ~15 lines)
2. Configurable threshold below the existing 100K hard cap
3. Integration at the existing limit point (line 2599-2610)

### Phased Rollout

**Phase 1: Unified insertion-time trimming**

Add a softer trimming tier that runs BEFORE the 100K hard cap:

```python
# Configurable thresholds
TOOL_RESULT_SOFT_TRIM_CHARS = 12_000    # ~3K tokens — trim above this
TOOL_RESULT_HEAD_CHARS = 4_000           # Keep first 4K chars
TOOL_RESULT_TAIL_CHARS = 4_000           # Keep last 4K chars
MAX_TOOL_RESULT_CHARS = 100_000          # Existing hard cap (unchanged)

def _trim_tool_result(result: str, tool_name: str) -> str:
    """Trim large tool results using head+tail strategy.
    
    Preserves the beginning (what was requested, headers, context)
    and end (final output, exit codes, summaries) of tool output.
    """
    if len(result) <= TOOL_RESULT_SOFT_TRIM_CHARS:
        return result
    
    trimmed_chars = len(result) - TOOL_RESULT_HEAD_CHARS - TOOL_RESULT_TAIL_CHARS
    return (
        result[:TOOL_RESULT_HEAD_CHARS]
        + f"\n\n[... {trimmed_chars:,} chars trimmed from {tool_name} output ...]\n\n"
        + result[-TOOL_RESULT_TAIL_CHARS:]
    )
```

Integration point — replace lines 2599-2610:

```python
# Phase 1: Soft trim large results (head + tail)
function_result = _trim_tool_result(function_result, function_name)

# Phase 2: Hard cap as emergency brake (unchanged)
if len(function_result) > MAX_TOOL_RESULT_CHARS:
    ...
```

Key property: **the tool result is trimmed once, before `messages.append()`, and never modified again.** Prompt caching prefixes remain stable.

**Phase 2: Per-tool-type thresholds**

Different tools have different information density patterns:

```python
TOOL_TRIM_PROFILES = {
    "terminal": {"soft_trim": 15_000, "head": 2_000, "tail": 8_000},   # tail-heavy: exit codes, final output
    "read_file": {"soft_trim": 10_000, "head": 5_000, "tail": 3_000},  # head-heavy: file start often most relevant
    "search_files": {"soft_trim": 8_000, "head": 4_000, "tail": 4_000}, # balanced: matches spread throughout
    "web_extract": {"soft_trim": 8_000, "head": 4_000, "tail": 2_000},  # head-heavy: summaries at top
    "default": {"soft_trim": 12_000, "head": 4_000, "tail": 4_000},
}
```

**Phase 3: Config exposure and telemetry**

- Add `tool_result_trimming` config section in `cli.py` CLI_CONFIG
- Log trimming events at debug level (tool name, original size, trimmed size)
- Track cumulative tokens saved per session for observability
- Consider making the soft threshold adaptive based on remaining context budget

---

## Pros & Cons

### Pros
- **Cache-friendly** — messages are only written once, never retroactively modified; prompt caching prefixes stay stable
- **Reduces token costs** — a 40K-char tool result becomes ~8K at insertion, saving ~8K tokens per bloated result
- **Reduces compression frequency** — fewer tokens in messages means the 85% threshold is hit less often, avoiding expensive LLM summarization calls
- **Preserves key information** — head+tail captures both the context (what was requested) and the conclusion (exit codes, final output, summaries)
- **Trivial to implement** — ~20 lines, fits at the existing limit point
- **Complements existing defenses** — sits below the 100K hard cap and above the 85% compression threshold

### Cons / Risks
- **Information loss** — middle content is discarded; in some cases (e.g., a specific error in a large log) the relevant info is in the middle
- **Threshold tuning** — the right `TOOL_RESULT_SOFT_TRIM_CHARS` depends on the model's context window; 12K chars might be too aggressive for 1M-context models or too lenient for 32K models
- **Interaction with tool-level limits** — some tools already limit their own output; double-trimming could over-reduce (mitigation: the outer trim is a no-op if the tool already trimmed below threshold)
- **Less context for complex debugging** — when the agent is debugging a large file and needs to reference full output from 5 turns ago, it won't be there

---

## Open Questions

- Should the soft trim threshold scale with the model's context window? (e.g., `min(12_000, context_length * 0.02)`)
- Should certain tool names be exempt from trimming? (e.g., `clarify`, `memory`, `todo` — already small, and their full content matters)
- Should the head/tail ratio be configurable or fixed? (Recommend: fixed defaults with Phase 2 per-tool profiles)
- Should trimming be logged to the user or just debug-level? (Recommend: debug-level, with a note in the tool result itself via the trimmed indicator)
- What's the right default threshold? 12K chars (~3K tokens) seems reasonable — that's still a substantial amount of context per tool result

---

## References

- [Utah source: pruneOldToolResults in agent-loop.ts](https://github.com/inngest/utah/blob/main/src/agent-loop.ts) (lines 49-96) — inspiration, but retroactive approach is incompatible with prefix caching
- [Blog post: "Your Agent Needs a Harness, Not a Framework"](https://www.inngest.com/blog/your-agent-needs-a-harness-not-a-framework)
- Hermes current implementation: `run_agent.py` lines 2599-2617 (tool result insertion + 100K hard cap)
- Hermes prompt caching: `agent/prompt_caching.py` (system_and_3 strategy, cache breakpoints on system + last 3 messages)
- Related issue: #132 (context length assumptions — correct context_length needed for adaptive thresholds)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Insertion-Time Tool Result Trimming — Cache-Friendly Context Management #415

Overview

Research Findings

Utah's Approach (and Why We Can't Copy It)

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Insertion-Time Tool Result Trimming — Cache-Friendly Context Management #415

Description

Overview

Research Findings

Utah's Approach (and Why We Can't Copy It)

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions