Skip to content

Feature: /microcompact — Instant, LLM-Free Surgical Context Stripping (inspired by vicnaum's Claude Code RE) #525

@teknium1

Description

@teknium1

Overview

Inspired by @vicnaum's reverse-engineering of Claude Code to add surgical context management, this proposes a new /microcompact command for Hermes Agent. It surgically removes heavy context artifacts — tool call/result pairs AND reasoning/thinking blocks — from conversation history without LLM summarization. Instant, free, and lossless for actual conversational content.

The core insight from vicnaum's work: when context fills up, the current options (/compress for LLM summarization, /clear for nuclear reset) are too coarse. Context is often 70-90% tool call/result pairs and thinking blocks — stripping these selectively can recover enormous context space while preserving every actual user/assistant message exchange intact. As one commenter put it: "/compact is a grenade. This is a scalpel."

This complements the existing automatic compression system (#513, #499, #415) by giving users a manual, surgical option when they want precise control.


Research Findings

How Claude Code's Context Management Works

Claude Code uses a three-layer system, revealed through vicnaum's reverse engineering:

Layer 1 — microcompact (silent, every turn): Runs silently before each API call. Replaces OLD tool_result content with placeholder text like [Previous: used {tool_name}]. Only targets results > 100 chars. Keeps a "hot tail" of the N most recent tool results intact. Never removes tool_use blocks themselves.

Layer 2 — auto-compact (threshold triggered): At ~75-95% context capacity, runs full LLM summarization. Saves full transcript to .transcripts/ before replacing. Structured summary replaces entire history.

Layer 3 — /compact (user triggered): Manual LLM summarization with optional focus hints.

Vicnaum identified the gap: even after Layer 1 runs, tons of tool artifacts remain. Thinking blocks inside messages with tool_use survive all cleanup. So he built /microcompact and /clear-thinking — instant commands with no LLM calls that surgically strip these artifacts.

Key Design Decisions

  1. Surgical over summarization — Stripping artifacts is lossless for conversational content. LLM summarization always loses detail and costs an API call.

  2. User-controlled scope — A picker UI lets users choose how far back to strip, rather than all-or-nothing. This preserves recent tool results that may still be relevant.

  3. Prompt caching consideration — Stripping elements from the message array destroys prompt cache prefixes. This is a real tradeoff: you save context space but may increase costs on the next API call due to cache miss. However, when you're at 90%+ context, the alternative (full compaction) destroys the cache anyway.

Anthropic's Server-Side Context Editing API

Anthropic has released a server-side Context Editing API (beta) that handles this at the API level. See #526 for integration details. The server-side approach preserves prompt cache (edits applied after cache lookup). Anthropic reports 29-39% performance improvement. That issue covers the Anthropic-specific server-side approach; this issue covers the universal client-side approach that works with ALL models.


Current State in Hermes Agent

What We Have

  1. /compress command (cli.py, gateway/run.py) — LLM-based summarization. Protects first 3 + last 4 messages, summarizes middle turns using auxiliary model (Gemini Flash). Costs an API call, loses detail.

  2. Automatic compression (run_agent.py) — Triggers at 85% context capacity. Same LLM summarization as /compress. Also triggers on 413 context-length errors.

  3. 100K char hard cap on tool results (run_agent.py L2606) — Only caps individual results at insertion time.

  4. Reasoning storage — Reasoning/thinking text stored in msg["reasoning"] field. reasoning_details preserved for multi-turn continuity. When building API messages, reasoning is converted to reasoning_content for API compatibility.

What's Missing (the Gap)

  • No selective stripping — Can't remove specific message types (tool artifacts, thinking) without full summarization
  • No instant cleanup option — Every cleanup path requires an LLM call or nuclear reset
  • No thinking block management — Thinking blocks accumulate across turns with no cleanup mechanism

Related Open Issues


Implementation Plan

Skill vs. Tool Classification

This should be a core codebase change, not a skill or tool. Reasons:

  • It modifies the conversation message array directly, requiring access to internal state (conversation_history, session transcripts)
  • It needs integration with the CLI command system and gateway command dispatch
  • It must coordinate with the existing compression system
  • It's a fundamental context management capability, same layer as /compress

The Command: /microcompact

A single command that strips BOTH tool artifacts AND thinking/reasoning blocks. One command, one action — no unnecessary complexity.

Usage:

/microcompact        # Strip all tool artifacts + thinking, keep last 3 turns intact
/microcompact 5      # Keep last 5 turns intact
/microcompact 0      # Strip everything (aggressive)

What We'd Need

  1. microcompact() function in run_agent.py or a new context_stripper.py:
def microcompact(messages: list[dict], keep_last_n: int = 3) -> list[dict]:
    """Surgically strip tool call/result pairs and thinking blocks
    from all but the last N assistant turns. Preserves all actual
    user/assistant text content."""
    
    # Find assistant messages with tool_calls (these are the "turns" to count)
    tool_turns = [(i, m) for i, m in enumerate(messages)
                  if m.get("role") == "assistant" and m.get("tool_calls")]
    
    # Determine which turns to strip (all except last N)
    turns_to_strip = tool_turns[:-keep_last_n] if keep_last_n else tool_turns
    
    # Collect tool_call_ids to remove
    strip_call_ids = set()
    for idx, msg in turns_to_strip:
        for tc in msg.get("tool_calls", []):
            strip_call_ids.add(tc["id"])
        # Remove tool_calls from the assistant message
        del msg["tool_calls"]
        # If the message has no content left, mark for removal
    
    # Remove corresponding tool result messages
    messages = [m for m in messages
                if not (m.get("role") == "tool" and 
                        m.get("tool_call_id") in strip_call_ids)]
    
    # Remove empty assistant messages (had only tool_calls, no text content)
    messages = [m for m in messages
                if not (m.get("role") == "assistant" and 
                        not m.get("content", "").strip() and
                        not m.get("tool_calls"))]
    
    # Strip thinking/reasoning from all but last N assistant messages
    assistant_msgs = [m for m in messages if m.get("role") == "assistant"]
    for msg in assistant_msgs[:-keep_last_n] if keep_last_n else assistant_msgs:
        msg.pop("reasoning", None)
        msg.pop("reasoning_content", None)
        msg.pop("reasoning_details", None)
        msg.pop("codex_reasoning_items", None)
    
    return messages
  1. CLI command in cli.py:

    • Register /microcompact in the COMMANDS dict
    • Handler: parse optional N argument, call microcompact(), report savings
  2. Gateway command in gateway/run.py:

    • Add microcompact to known commands
    • Handler mirrors /compress pattern: load transcript, strip, rewrite
  3. Session transcript rewrite — After stripping, rewrite using rewrite_transcript() (same as /compress uses)

Phased Rollout

Phase 1: The Command

  • Implement microcompact() stripping function
  • Add /microcompact [N] to CLI and gateway
  • Report before/after token estimates and message counts
  • Default: keep last 3 turns intact

Phase 2: Automatic Integration

Phase 3: Smart Defaults


Technical Details

Message Structure Reference

# Assistant message with tool call + thinking
{"role": "assistant", "content": "", "reasoning": "...(thinking)...",
 "reasoning_details": [...],
 "tool_calls": [{"id": "call_abc", "function": {"name": "terminal", "arguments": "{...}"}}],
 "finish_reason": "tool_calls"}

# Tool result message  
{"role": "tool", "content": "...(potentially huge output)...", "tool_call_id": "call_abc"}

# Assistant message with actual text content
{"role": "assistant", "content": "Here's what I found...", "reasoning": "...",
 "finish_reason": "stop"}

Orphan Prevention

When stripping tool_calls from an assistant message:

  • If the message also has content text → keep the message, only remove tool_calls
  • If the message has NO content (purely a tool-calling turn) → remove the entire message
  • Always remove the corresponding role: "tool" result messages
  • This prevents orphaned tool_call_id references

What Gets Stripped vs Preserved

Component Stripped? Notes
tool_calls on assistant msgs ✅ Yes (except last N turns) The tool invocation metadata
role: "tool" result msgs ✅ Yes (matching stripped calls) The heavy tool output content
reasoning field ✅ Yes (except last N turns) Thinking/reasoning text
reasoning_details ✅ Yes (except last N turns) Opaque provider reasoning state
reasoning_content ✅ Yes (except last N turns) API-format reasoning
User messages ❌ Never All user text preserved
Assistant content text ❌ Never All assistant text preserved
System messages ❌ Never System prompt untouched

Pros & Cons

Pros

  • Instant and free — No LLM call, no API cost, sub-second execution
  • Lossless for conversation — Every actual user/assistant text message preserved
  • Massive space recovery — Tool results are typically 60-80% of context. Thinking blocks 10-20%. Combined: 70-90% recovery.
  • Dead simple — One command, one function, ~50 lines of core logic
  • Universal — Works with any model/provider
  • Interpretable — Single command name, obvious behavior, clear output

Cons / Risks

  • Prompt cache invalidation — Modifying the message array destroys cached prefixes. Same tradeoff as /compress.
  • Loss of tool context — Model loses knowledge of old tool results. May re-run tools. Mitigated by keeping last N turns.
  • Reasoning continuity — Stripping reasoning_details may break multi-turn reasoning chains on providers using opaque reasoning state. Mitigated by keeping last N turns' reasoning.

Open Questions

  1. Default keep count — Keep last 3 turns? 5? Claude Code defaults to 3.
  2. Should /compress auto-microcompact first? — Before LLM summarization, strip artifacts for free. Relates to Feature: Two-Phase Context Management — Prune Tool Outputs Before Full Compaction (inspired by Kilocode) #513.
  3. Token counting — Show exact counts or rough estimates (4 chars/token)?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions