You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inspired by @vicnaum's reverse-engineering of Claude Code to add surgical context management, this proposes a new /microcompact command for Hermes Agent. It surgically removes heavy context artifacts — tool call/result pairs AND reasoning/thinking blocks — from conversation history without LLM summarization. Instant, free, and lossless for actual conversational content.
The core insight from vicnaum's work: when context fills up, the current options (/compress for LLM summarization, /clear for nuclear reset) are too coarse. Context is often 70-90% tool call/result pairs and thinking blocks — stripping these selectively can recover enormous context space while preserving every actual user/assistant message exchange intact. As one commenter put it: "/compact is a grenade. This is a scalpel."
This complements the existing automatic compression system (#513, #499, #415) by giving users a manual, surgical option when they want precise control.
Research Findings
How Claude Code's Context Management Works
Claude Code uses a three-layer system, revealed through vicnaum's reverse engineering:
Layer 1 — microcompact (silent, every turn): Runs silently before each API call. Replaces OLD tool_result content with placeholder text like [Previous: used {tool_name}]. Only targets results > 100 chars. Keeps a "hot tail" of the N most recent tool results intact. Never removes tool_use blocks themselves.
Layer 2 — auto-compact (threshold triggered): At ~75-95% context capacity, runs full LLM summarization. Saves full transcript to .transcripts/ before replacing. Structured summary replaces entire history.
Vicnaum identified the gap: even after Layer 1 runs, tons of tool artifacts remain. Thinking blocks inside messages with tool_use survive all cleanup. So he built /microcompact and /clear-thinking — instant commands with no LLM calls that surgically strip these artifacts.
Key Design Decisions
Surgical over summarization — Stripping artifacts is lossless for conversational content. LLM summarization always loses detail and costs an API call.
User-controlled scope — A picker UI lets users choose how far back to strip, rather than all-or-nothing. This preserves recent tool results that may still be relevant.
Prompt caching consideration — Stripping elements from the message array destroys prompt cache prefixes. This is a real tradeoff: you save context space but may increase costs on the next API call due to cache miss. However, when you're at 90%+ context, the alternative (full compaction) destroys the cache anyway.
Anthropic's Server-Side Context Editing API
Anthropic has released a server-side Context Editing API (beta) that handles this at the API level. See #526 for integration details. The server-side approach preserves prompt cache (edits applied after cache lookup). Anthropic reports 29-39% performance improvement. That issue covers the Anthropic-specific server-side approach; this issue covers the universal client-side approach that works with ALL models.
Current State in Hermes Agent
What We Have
/compress command (cli.py, gateway/run.py) — LLM-based summarization. Protects first 3 + last 4 messages, summarizes middle turns using auxiliary model (Gemini Flash). Costs an API call, loses detail.
Automatic compression (run_agent.py) — Triggers at 85% context capacity. Same LLM summarization as /compress. Also triggers on 413 context-length errors.
100K char hard cap on tool results (run_agent.py L2606) — Only caps individual results at insertion time.
Reasoning storage — Reasoning/thinking text stored in msg["reasoning"] field. reasoning_details preserved for multi-turn continuity. When building API messages, reasoning is converted to reasoning_content for API compatibility.
What's Missing (the Gap)
No selective stripping — Can't remove specific message types (tool artifacts, thinking) without full summarization
No instant cleanup option — Every cleanup path requires an LLM call or nuclear reset
No thinking block management — Thinking blocks accumulate across turns with no cleanup mechanism
This should be a core codebase change, not a skill or tool. Reasons:
It modifies the conversation message array directly, requiring access to internal state (conversation_history, session transcripts)
It needs integration with the CLI command system and gateway command dispatch
It must coordinate with the existing compression system
It's a fundamental context management capability, same layer as /compress
The Command: /microcompact
A single command that strips BOTH tool artifacts AND thinking/reasoning blocks. One command, one action — no unnecessary complexity.
Usage:
/microcompact # Strip all tool artifacts + thinking, keep last 3 turns intact
/microcompact 5 # Keep last 5 turns intact
/microcompact 0 # Strip everything (aggressive)
What We'd Need
microcompact() function in run_agent.py or a new context_stripper.py:
defmicrocompact(messages: list[dict], keep_last_n: int=3) ->list[dict]:
"""Surgically strip tool call/result pairs and thinking blocks from all but the last N assistant turns. Preserves all actual user/assistant text content."""# Find assistant messages with tool_calls (these are the "turns" to count)tool_turns= [(i, m) fori, minenumerate(messages)
ifm.get("role") =="assistant"andm.get("tool_calls")]
# Determine which turns to strip (all except last N)turns_to_strip=tool_turns[:-keep_last_n] ifkeep_last_nelsetool_turns# Collect tool_call_ids to removestrip_call_ids=set()
foridx, msginturns_to_strip:
fortcinmsg.get("tool_calls", []):
strip_call_ids.add(tc["id"])
# Remove tool_calls from the assistant messagedelmsg["tool_calls"]
# If the message has no content left, mark for removal# Remove corresponding tool result messagesmessages= [mforminmessagesifnot (m.get("role") =="tool"andm.get("tool_call_id") instrip_call_ids)]
# Remove empty assistant messages (had only tool_calls, no text content)messages= [mforminmessagesifnot (m.get("role") =="assistant"andnotm.get("content", "").strip() andnotm.get("tool_calls"))]
# Strip thinking/reasoning from all but last N assistant messagesassistant_msgs= [mforminmessagesifm.get("role") =="assistant"]
formsginassistant_msgs[:-keep_last_n] ifkeep_last_nelseassistant_msgs:
msg.pop("reasoning", None)
msg.pop("reasoning_content", None)
msg.pop("reasoning_details", None)
msg.pop("codex_reasoning_items", None)
returnmessages
CLI command in cli.py:
Register /microcompact in the COMMANDS dict
Handler: parse optional N argument, call microcompact(), report savings
# Assistant message with tool call + thinking
{"role": "assistant", "content": "", "reasoning": "...(thinking)...",
"reasoning_details": [...],
"tool_calls": [{"id": "call_abc", "function": {"name": "terminal", "arguments": "{...}"}}],
"finish_reason": "tool_calls"}
# Tool result message
{"role": "tool", "content": "...(potentially huge output)...", "tool_call_id": "call_abc"}
# Assistant message with actual text content
{"role": "assistant", "content": "Here's what I found...", "reasoning": "...",
"finish_reason": "stop"}
Orphan Prevention
When stripping tool_calls from an assistant message:
If the message also has content text → keep the message, only remove tool_calls
If the message has NO content (purely a tool-calling turn) → remove the entire message
Always remove the corresponding role: "tool" result messages
This prevents orphaned tool_call_id references
What Gets Stripped vs Preserved
Component
Stripped?
Notes
tool_calls on assistant msgs
✅ Yes (except last N turns)
The tool invocation metadata
role: "tool" result msgs
✅ Yes (matching stripped calls)
The heavy tool output content
reasoning field
✅ Yes (except last N turns)
Thinking/reasoning text
reasoning_details
✅ Yes (except last N turns)
Opaque provider reasoning state
reasoning_content
✅ Yes (except last N turns)
API-format reasoning
User messages
❌ Never
All user text preserved
Assistant content text
❌ Never
All assistant text preserved
System messages
❌ Never
System prompt untouched
Pros & Cons
Pros
Instant and free — No LLM call, no API cost, sub-second execution
Lossless for conversation — Every actual user/assistant text message preserved
Massive space recovery — Tool results are typically 60-80% of context. Thinking blocks 10-20%. Combined: 70-90% recovery.
Dead simple — One command, one function, ~50 lines of core logic
Universal — Works with any model/provider
Interpretable — Single command name, obvious behavior, clear output
Cons / Risks
Prompt cache invalidation — Modifying the message array destroys cached prefixes. Same tradeoff as /compress.
Loss of tool context — Model loses knowledge of old tool results. May re-run tools. Mitigated by keeping last N turns.
Reasoning continuity — Stripping reasoning_details may break multi-turn reasoning chains on providers using opaque reasoning state. Mitigated by keeping last N turns' reasoning.
Open Questions
Default keep count — Keep last 3 turns? 5? Claude Code defaults to 3.
Overview
Inspired by @vicnaum's reverse-engineering of Claude Code to add surgical context management, this proposes a new
/microcompactcommand for Hermes Agent. It surgically removes heavy context artifacts — tool call/result pairs AND reasoning/thinking blocks — from conversation history without LLM summarization. Instant, free, and lossless for actual conversational content.The core insight from vicnaum's work: when context fills up, the current options (
/compressfor LLM summarization,/clearfor nuclear reset) are too coarse. Context is often 70-90% tool call/result pairs and thinking blocks — stripping these selectively can recover enormous context space while preserving every actual user/assistant message exchange intact. As one commenter put it: "/compact is a grenade. This is a scalpel."This complements the existing automatic compression system (#513, #499, #415) by giving users a manual, surgical option when they want precise control.
Research Findings
How Claude Code's Context Management Works
Claude Code uses a three-layer system, revealed through vicnaum's reverse engineering:
Layer 1 — microcompact (silent, every turn): Runs silently before each API call. Replaces OLD tool_result content with placeholder text like
[Previous: used {tool_name}]. Only targets results > 100 chars. Keeps a "hot tail" of the N most recent tool results intact. Never removestool_useblocks themselves.Layer 2 — auto-compact (threshold triggered): At ~75-95% context capacity, runs full LLM summarization. Saves full transcript to
.transcripts/before replacing. Structured summary replaces entire history.Layer 3 — /compact (user triggered): Manual LLM summarization with optional focus hints.
Vicnaum identified the gap: even after Layer 1 runs, tons of tool artifacts remain. Thinking blocks inside messages with tool_use survive all cleanup. So he built
/microcompactand/clear-thinking— instant commands with no LLM calls that surgically strip these artifacts.Key Design Decisions
Surgical over summarization — Stripping artifacts is lossless for conversational content. LLM summarization always loses detail and costs an API call.
User-controlled scope — A picker UI lets users choose how far back to strip, rather than all-or-nothing. This preserves recent tool results that may still be relevant.
Prompt caching consideration — Stripping elements from the message array destroys prompt cache prefixes. This is a real tradeoff: you save context space but may increase costs on the next API call due to cache miss. However, when you're at 90%+ context, the alternative (full compaction) destroys the cache anyway.
Anthropic's Server-Side Context Editing API
Anthropic has released a server-side Context Editing API (beta) that handles this at the API level. See #526 for integration details. The server-side approach preserves prompt cache (edits applied after cache lookup). Anthropic reports 29-39% performance improvement. That issue covers the Anthropic-specific server-side approach; this issue covers the universal client-side approach that works with ALL models.
Current State in Hermes Agent
What We Have
/compresscommand (cli.py, gateway/run.py) — LLM-based summarization. Protects first 3 + last 4 messages, summarizes middle turns using auxiliary model (Gemini Flash). Costs an API call, loses detail.Automatic compression (run_agent.py) — Triggers at 85% context capacity. Same LLM summarization as
/compress. Also triggers on 413 context-length errors.100K char hard cap on tool results (run_agent.py L2606) — Only caps individual results at insertion time.
Reasoning storage — Reasoning/thinking text stored in
msg["reasoning"]field.reasoning_detailspreserved for multi-turn continuity. When building API messages,reasoningis converted toreasoning_contentfor API compatibility.What's Missing (the Gap)
Related Open Issues
Implementation Plan
Skill vs. Tool Classification
This should be a core codebase change, not a skill or tool. Reasons:
conversation_history, session transcripts)/compressThe Command:
/microcompactA single command that strips BOTH tool artifacts AND thinking/reasoning blocks. One command, one action — no unnecessary complexity.
Usage:
What We'd Need
microcompact()function inrun_agent.pyor a newcontext_stripper.py:CLI command in
cli.py:/microcompactin the COMMANDS dictmicrocompact(), report savingsGateway command in
gateway/run.py:microcompactto known commands/compresspattern: load transcript, strip, rewriteSession transcript rewrite — After stripping, rewrite using
rewrite_transcript()(same as/compressuses)Phased Rollout
Phase 1: The Command
microcompact()stripping function/microcompact [N]to CLI and gatewayPhase 2: Automatic Integration
microcompact()as Phase 1 pruner before LLM compaction/compress)microcompact.keep_last_n: 3defaultPhase 3: Smart Defaults
Technical Details
Message Structure Reference
Orphan Prevention
When stripping tool_calls from an assistant message:
contenttext → keep the message, only removetool_callsrole: "tool"result messagesWhat Gets Stripped vs Preserved
tool_callson assistant msgsrole: "tool"result msgsreasoningfieldreasoning_detailsreasoning_contentcontenttextPros & Cons
Pros
Cons / Risks
/compress.reasoning_detailsmay break multi-turn reasoning chains on providers using opaque reasoning state. Mitigated by keeping last N turns' reasoning.Open Questions
/compressauto-microcompact first? — Before LLM summarization, strip artifacts for free. Relates to Feature: Two-Phase Context Management — Prune Tool Outputs Before Full Compaction (inspired by Kilocode) #513.References