Skip to content

feat(caching): multi-block system prompt with tiered TTLs (v2)#5713

Open
Deland78 wants to merge 2 commits into
NousResearch:mainfrom
Deland78:feat/prompt-caching-v2
Open

feat(caching): multi-block system prompt with tiered TTLs (v2)#5713
Deland78 wants to merge 2 commits into
NousResearch:mainfrom
Deland78:feat/prompt-caching-v2

Conversation

@Deland78

@Deland78 Deland78 commented Apr 7, 2026

Copy link
Copy Markdown

Summary

Refactor Anthropic prompt caching to use a structured multi-block system prompt with per-block cache_control markers instead of a single monolithic system message. This maximizes cache hits by isolating volatile content (timestamps, platform hints) from stable content (identity, skills, memory).

Architecture

The system prompt is now assembled as three SystemPromptBlock instances with different cache TTLs:

Block TTL Contents
static 1h Soul.md / default identity, tool-aware guidance (memory, session_search, skills), Nous subscription prompt, tool-use enforcement, model-specific operational guidance (Google/OpenAI), skills system prompt
session 5m Custom system_message, memory store blocks (memory + user), external memory provider block, context files (AGENTS.md/CLAUDE.md/etc.)
ephemeral none Timestamp + session/model/provider line, Alibaba identity workaround, platform hints

At API call time, blocks are converted to Anthropic content block format (`[{type: text, text: ..., cache_control: ...}, ...]`) and sent as the system message. Non-caching models fall through to the flat-string path unchanged.

New public API in `agent/prompt_caching.py`

  • `SystemPromptBlock`, `CacheMetrics`, `AggregatedCacheMetrics` dataclasses
  • `build_system_content_blocks(blocks)` — convert blocks to Anthropic format
  • `apply_anthropic_cache_control_v2(messages, tools, cache_ttl, native_anthropic)` — multi-block + tool caching with budget management (max 4 breakpoints across tools + system + messages)
  • `extract_cache_metrics(usage, api_mode)` — per-call cache extraction supporting both native Anthropic (`cache_read_input_tokens`, `cache_creation_input_tokens`) and OpenRouter (`prompt_tokens_details.cached_tokens`) response formats
  • `aggregate_cache_metrics(metrics_list)` — cross-turn aggregation

The v1 `apply_anthropic_cache_control` function and `_apply_cache_marker` helper are preserved unchanged for backward compatibility.

Integration in `run_agent.py`

  • New `_build_system_prompt_blocks()` method assembles the three tiered blocks and caches them on `self._cached_system_blocks`
  • The existing `_build_system_prompt()` method still returns a flat string (for backward compatibility with code paths that expect one) but now delegates to the block builder
  • Cached blocks are invalidated on context compression (`_cached_system_blocks = None` alongside `_cached_system_prompt = None`)
  • At API call time, when `_use_prompt_caching` is enabled and `_cached_system_blocks` is populated, a multi-block path builds `{role: system, content: [...]}` with cache_control markers already set per block
  • Plugin turn context (`_plugin_turn_context`) remains reserved for future system-level plugin instructions; plugin context from pre_llm_call hooks still goes into user messages (unchanged)
  • Fallback flat-string path handles non-caching models and pre-structured content correctly

Test coverage

  • `tests/agent/test_prompt_caching.py` — 46 unit tests covering v1 (preserved) and v2 functions: data structures, cache markers, content block conversion, pre-structured detection, breakpoint budgeting, metrics extraction and aggregation
  • `tests/agent/test_prompt_caching_v2.py` — 38 additional integration tests for v2 behavior (tool caching interaction with system blocks, budget with pre-structured content, backward compatibility with v1 code paths)
  • `tests/test_prompt_caching_integration.py` — 10 integration tests against `run_agent.py` block assembly (three-block structure, tier TTLs, timestamp in ephemeral block only, cache invalidation, backward-compat string return, non-caching models unaffected)

Verified: 317 tests passing (all of the above plus `tests/test_run_agent.py` regression suite).

Test plan

  • All new v2 unit tests pass (`pytest tests/agent/test_prompt_caching.py tests/agent/test_prompt_caching_v2.py`)
  • Integration tests against `run_agent.py` block assembly pass (`pytest tests/test_prompt_caching_integration.py`)
  • Full run_agent.py regression suite passes (`pytest tests/test_run_agent.py`)
  • `run_agent` imports cleanly
  • Manual: verify cache hit rate improves on a multi-turn conversation with stable context files (reviewer action)
  • Manual: verify non-caching models (e.g. local Ollama) still work via flat-string fallback (reviewer action)

Platforms tested

Linux (WSL2, Ubuntu 22.04), Python 3.11

🤖 Generated with Claude Code

Deland78 and others added 2 commits April 6, 2026 21:50
Refactor prompt caching to use structured SystemPromptBlocks with
per-block cache_control markers instead of a single monolithic system
prompt. This maximizes Anthropic prompt cache hits by isolating volatile
content (timestamps, platform hints) from stable content (identity,
skills, memory).

Architecture:
  - static block  (1h TTL): identity, tool guidance, skills, model-specific
                            guidance — cross-session stable
  - session block (5m TTL): memory, context files, custom system_message —
                            session-stable
  - ephemeral block (none): timestamp, platform hints, alibaba workaround —
                            changes per-turn

New public API in agent/prompt_caching.py:
  - SystemPromptBlock, CacheMetrics, AggregatedCacheMetrics dataclasses
  - build_system_content_blocks() — convert blocks to Anthropic format
  - apply_anthropic_cache_control_v2() — multi-block + tool caching
  - extract_cache_metrics() — per-call cache extraction (native + OpenRouter)
  - aggregate_cache_metrics() — cross-turn aggregation

In run_agent.py:
  - _build_system_prompt_blocks() assembles the three tiered blocks and
    caches them on self._cached_system_blocks
  - At API call time, blocks are converted to content blocks with
    cache_control markers and sent as the system message
  - Falls back to flat-string path for non-caching models
  - Plugin context stays in user messages (unchanged from v1)

Test coverage:
  - tests/agent/test_prompt_caching.py — 46 unit tests covering all v2
    functions (data structures, marker building, content block conversion,
    pre-structured detection, breakpoint budgeting, metrics)
  - tests/agent/test_prompt_caching_v2.py — 38 additional tests for v2
    integration (tool caching, budget interaction, backward compat)
  - tests/test_prompt_caching_integration.py — 10 integration tests against
    run_agent.py block assembly (tier structure, cache invalidation,
    backward compat with v1 code paths)

Verified: 317 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@markojak

Copy link
Copy Markdown

Linking this into the #17459 direction.

The overall cache architecture here may still be useful, but please keep it aligned with the simpler rule from #17459/#17476: stable cached prompt/cacheable prefix, volatile current time in ephemeral runtime/user-message/tool context.

This PR should not be required as a prerequisite for fixing the immediate duplicate-tool cache bug (#17335), and it should not introduce hidden quiet-hours/control-plane policy.

@alt-glitch alt-glitch added type/perf Performance improvement or optimization P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder provider/anthropic Anthropic native Messages API labels Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists provider/anthropic Anthropic native Messages API type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants