Skip to content

Feature: Anthropic Context Editing API Integration — Server-Side, Cache-Friendly Tool/Thinking Cleanup for Claude Models #526

@teknium1

Description

@teknium1

Overview

Anthropic has released a server-side Context Editing API (beta: anthropic-beta: context-management-2025-06-27) that automatically manages context by clearing old tool use/result pairs and thinking blocks at the API level, before token counting and after prompt cache lookup. This means context is managed without destroying prompt cache prefixes — a significant advantage over client-side stripping.

This was discovered while researching @vicnaum's reverse-engineering of Claude Code, which revealed that Claude Code already uses internal microcompact logic. Anthropic has now exposed similar functionality as a public API parameter. Anthropic reports 29% performance improvement with context editing alone, and 39% with context editing + memory tool over baseline.

Since Hermes Agent already supports Anthropic/Claude models (via OpenRouter and direct API), integrating this API would provide zero-cost, cache-friendly context management for Claude users with minimal implementation effort.


Research Findings

How the Context Editing API Works

The API adds a context_management parameter to the Messages API request body, containing an ordered list of edits:

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 8096,
  "context_management": {
    "edits": [
      {
        "type": "clear_thinking_20251015",
        "keep": {"type": "thinking_turns", "value": 2}
      },
      {
        "type": "clear_tool_uses_20250919",
        "trigger": {"type": "input_tokens", "value": 50000},
        "keep": {"type": "tool_uses", "value": 5},
        "clear_at_least": {"type": "input_tokens", "value": 5000},
        "exclude_tools": ["web_search"]
      }
    ]
  },
  "messages": [...]
}

Edit Type 1: clear_tool_uses_20250919

  • Clears oldest tool_use/tool_result pairs when input tokens exceed a trigger threshold
  • Configurable parameters:
    • trigger — Input token count to activate (default ~100K)
    • keep — Number of recent tool pairs to preserve (default 3)
    • clear_at_least — Minimum tokens to clear per activation
    • exclude_tools — Tool names never cleared (e.g., important tools like memory, skill)
    • clear_tool_inputs — Also clear tool input params, not just results (default false)

Edit Type 2: clear_thinking_20251015

  • Manages extended thinking blocks
  • Default: keeps only last assistant turn's thinking
  • Config: keep: {"type": "thinking_turns", "value": N} or keep: "all" to maximize cache hits
  • Must be listed FIRST in the edits array

Key Advantages Over Client-Side Stripping

  1. Prompt cache preservation — Edits are applied server-side AFTER cache lookup, so existing cached prefixes are reused. Client-side modifications to the message array invalidate the cache.
  2. Zero implementation complexity — Just add parameters to the API request. No message manipulation logic needed.
  3. Anthropic-optimized — Anthropic knows the exact tokenization, can make optimal clearing decisions.
  4. Automatic — No user intervention needed. Works transparently on every API call.

Limitations

  • Claude-only — Only works with Anthropic's Messages API. Not available for OpenAI, Gemini, or other providers.
  • Beta status — Requires beta header. May change before GA.
  • OpenRouter compatibility unknown — Need to verify if OpenRouter passes through context_management and beta headers to Anthropic.
  • Less user control — Automatic only, no manual trigger. For manual control, see the companion client-side stripping issue.

Current State in Hermes Agent

API Call Path

Hermes builds API kwargs in _build_api_kwargs() (run_agent.py L2061-2149):

api_kwargs = {
    "model": self.model,
    "messages": api_messages,
    "tools": self.tools if self.tools else None,
    "timeout": 900.0,
}
extra_body = {}
# ... provider preferences, reasoning config ...
if extra_body:
    api_kwargs["extra_body"] = extra_body

The extra_body mechanism already exists and is used for OpenRouter provider preferences and reasoning config. The context_management parameter could be injected here for Anthropic models.

Provider Detection

Hermes already detects OpenRouter ("openrouter" in self.base_url.lower()) and Nous Portal ("nousresearch" in self.base_url.lower()). It would need to also detect direct Anthropic API usage and determine model provider when going through OpenRouter (e.g., anthropic/claude-* model strings).

No Existing Support

  • No context_management parameter is currently sent
  • No Anthropic beta headers are set
  • No model-provider-specific API features beyond reasoning config

Implementation Plan

Skill vs. Tool Classification

This should be a core codebase change to run_agent.py (specifically _build_api_kwargs()). Reasons:

  • Requires modification of API request parameters at the lowest level
  • Needs provider detection logic already in the codebase
  • Must integrate with existing context management (coordinate with compression thresholds)
  • Not expressible as instructions + shell commands (skill) or a standalone callable (tool)

What We'd Need

  1. Provider detection — Determine if the target model is Claude/Anthropic:

    • Direct Anthropic API: base_url contains anthropic
    • OpenRouter: model string starts with anthropic/
    • Config flag: explicit context_editing: true in model config
  2. Context management injection — In _build_api_kwargs(), conditionally add context_management to extra_body for Anthropic models

  3. Configuration — User-configurable parameters:

    • Enable/disable context editing
    • Tool use trigger threshold
    • Keep count for tool uses and thinking turns
    • Excluded tools list
  4. Beta header — For direct Anthropic API access, pass the beta header. For OpenRouter, verify passthrough behavior.

Phased Rollout

Phase 1: Basic Integration

  • Add context_management to extra_body for Anthropic models
  • Default conservative settings: trigger at 60% of context window, keep last 5 tool uses, keep last 2 thinking turns
  • Config option to enable/disable: context_editing.enabled: true
  • Exclude critical tools by default: memory, skill_manage, todo

Phase 2: Configuration & Tuning

  • Expose all parameters in config.yaml:
    context_editing:
      enabled: true
      trigger_tokens: 100000
      keep_tool_uses: 5
      keep_thinking_turns: 2
      exclude_tools: [memory, skill_manage, todo]
      clear_tool_inputs: false
  • Auto-scale trigger threshold based on model context window
  • Coordinate with existing compression: if context editing is active, raise compression threshold (since context editing handles the first layer of cleanup)

Phase 3: Smart Defaults & Monitoring

  • Auto-detect Anthropic models and enable context editing by default
  • Log context editing activity (how many tokens cleared, how often it triggers)
  • Surface context editing stats in /usage command
  • Investigate OpenRouter passthrough behavior and document compatibility

Pros & Cons

Pros

  • Prompt cache friendly — The Terminal tool #1 advantage. Server-side editing preserves cached prefixes, reducing costs.
  • Zero client complexity — Just add a dict to the API request. ~20 lines of code.
  • Proven at scale — Anthropic uses this internally in Claude Code. 29-39% performance improvement reported.
  • Automatic — No user action needed. Works transparently.
  • Composable — Works alongside client-side stripping and LLM compaction for defense-in-depth.

Cons / Risks

  • Anthropic-only — Does nothing for OpenAI, Gemini, open-source models. Client-side stripping (companion issue) covers those.
  • Beta API — May change. Need to handle gracefully if the server rejects the parameter.
  • OpenRouter uncertainty — OpenRouter may not pass through context_management or beta headers. Needs testing.
  • Reduced visibility — Server-side clearing is invisible to the client. The model may behave differently than expected because old tool results are silently cleared. Need to log when this happens.
  • Coordination complexity — Must coordinate with existing compression thresholds. If context editing clears 40K tokens server-side, the client's token counting won't reflect this, potentially causing premature compression triggers.

Open Questions

  1. OpenRouter passthrough — Does OpenRouter forward context_management and Anthropic beta headers to the Anthropic backend? This is critical since most Hermes users go through OpenRouter.
  2. Token counting sync — After server-side editing, the client's token estimate diverges from reality. How do we handle this? Could use the API response's usage field to recalibrate.
  3. Coordination with compression — If context editing is active, should we increase the compression threshold? E.g., from 85% to 95%, since context editing provides a buffer?
  4. Should this be opt-in or opt-out? — Conservative: opt-in. Aggressive: auto-enable for Anthropic models. Recommended: opt-in in Phase 1, auto-enable in Phase 3 after validation.
  5. Other providers — Will OpenAI or Google offer similar server-side context editing APIs? Should we design the abstraction to be provider-agnostic from the start?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions