fix(memory): review fork rebuilds system prompt, breaking prefix cache (~26% per-cycle / ~92% per-fork savings) by WorldWriter · Pull Request #17089 · NousResearch/hermes-agent

WorldWriter · 2026-04-28T16:42:17Z

~26% reduction on full cycle cost (10 main turns + 1 review fork) and ~92% on the review fork itself, in a typical 10-turn session against Sonnet 4.5. One-line fix: review fork now inherits parent's _cached_system_prompt instead of rebuilding it with a drifted timestamp that invalidates the entire prefix cache.

_spawn_background_review rebuilds the review fork's system prompt via _build_system_prompt, which calls hermes_time.now() at minute precision. The 1-minute drift between session start and review fire-time invalidates the Anthropic prefix cache for the entire messages_snapshot. This silently violates the project's own caching policy (see AGENTS.md):

"Hermes-Agent ensures caching remains valid throughout a conversation. Do NOT implement changes that would [...] reload memories or rebuild system prompts mid-conversation. Cache-breaking forces dramatically higher costs."

Empirical cache-mechanism verification (real Anthropic API + real hermes flow)

Driving AIAgent.run_conversation() against the live Anthropic API (Sonnet 4.5), captured via Messages.stream interception. One main turn primes the cache, then review fires with a 10-turn messages_snapshot. This isolates the system-prompt-drift effect: the only variable across the two review calls is whether the fix is applied.

Call	input	`cache_create`	`cache_read`	output
Main agent (1 turn, primes cache)	3	14,849	0	4
Review WITHOUT fix	3	17,386	0	7
Review WITH fix	3	2,864	14,522	7

Pricing (Sonnet 4.5, per 1M tokens): input $3, output $15, cache_write $3.75, cache_read $0.30.

The cache_read jumping from 0 to 14,522 confirms the fix restores prefix-cache hits exactly as predicted; the 14,522 cached tokens correspond to the system prompt that main has already cached. (This anchor is single-turn-prime, so main hadn't cached the messages yet — review still pays for those.)

Projected to a realistic 10-turn cycle

In real sessions main runs ~10 turns before the review fires (default nudge_interval=10), so the messages also enter main's cache. Review WITH fix then hits the whole prefix, not just system. Building the per-turn ledger from the empirical anchor — assuming each main turn adds 200 input + 500 output tokens of substantive work:

Stage	input	cache_create	cache_read	output	$
Main 10 turns (cumulative)	2,000	20,822	155,898	5,000	$0.2057
Review WITHOUT fix	0	21,572	0	10	$0.0810
Review WITH fix	0	50	21,522	10	$0.0068

	Main	Review	Cycle total
WITHOUT fix	$0.2057	$0.0810	$0.2867
WITH fix	$0.2057	$0.0068	$0.2125
Saved	—	$0.0742	$0.0742 (~26%)

Per-fork savings reaches ~92% because by turn 10 main has pre-cached system + the entire message history; review with the fix only pays for REVIEW_PROMPT itself. Cycle savings sits at ~26%.

Sensitivity to per-turn output verbosity:

Per-turn output	Main 10-turn $	Cycle savings
100 tok	$0.146	~33%
500 tok (baseline above)	$0.206	~26%
1,000 tok	$0.281	~21%
2,000 tok	$0.431	~15%

Triggers fire every 10/20/30/... turns; review cost scales O(N²) with conversation length while main scales O(N), so cycle savings ratio rises rather than dilutes over long sessions.

The divergence is one character

Two AIAgent instances 65 s apart calling _build_system_prompt():

-Conversation started: Wednesday, April 29, 2026 12:57 AM
+Conversation started: Wednesday, April 29, 2026 12:58 AM

That single character is the entire bug surface. The fix makes review's _cached_system_prompt byte-identical to main's, restoring prefix cache hits.

Why safe

_cached_system_prompt is a str — no shared mutable state
Sharing parent's "Conversation started" timestamp is more correct semantically (it timestamps the conversation under review, not when review fires)
Mirrors existing pattern: _memory_store, _memory_enabled, _user_profile_enabled already inherited at the same site

End-to-end repro script (~$0.05 to run)

"""End-to-end hermes test: run_conversation against real Anthropic API."""
import os, sys, time, threading
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

import anthropic
records = []
records_lock = threading.Lock()
current_label = ["unknown"]
_orig_stream = anthropic.resources.messages.messages.Messages.stream

def patched_stream(self, **kw):
    cm = _orig_stream(self, **kw)
    label = current_label[0]
    class _W:
        def __enter__(_i):
            _i._s = cm.__enter__(); return _i._s
        def __exit__(_i, *exc):
            try:
                u = _i._s.get_final_message().usage
                with records_lock:
                    records.append({"label": label, "input": u.input_tokens,
                        "cache_create": getattr(u, "cache_creation_input_tokens", 0) or 0,
                        "cache_read": getattr(u, "cache_read_input_tokens", 0) or 0})
            except Exception: pass
            return cm.__exit__(*exc)
    return _W()
anthropic.resources.messages.messages.Messages.stream = patched_stream

from run_agent import AIAgent

main = AIAgent(model="claude-sonnet-4-5-20250929",
               provider="anthropic", api_mode="anthropic_messages",
               base_url="https://api.anthropic.com",
               api_key=os.environ["ANTHROPIC_API_KEY"],
               quiet_mode=True, max_iterations=1)

current_label[0] = "main_turn_1"
main.run_conversation(user_message="Reply 'ok' only.", conversation_history=[])
time.sleep(65)  # force minute-precision timestamp drift

fake_history = []
for i in range(10):
    fake_history.append({"role": "user", "content": (f"Q{i}: " + ("blah " * 60)).strip()})
    fake_history.append({"role": "assistant", "content": (f"A{i}: " + ("yada " * 60)).strip()})
REVIEW_PROMPT = "Review the conversation above. Just say 'Nothing to save.' and stop."

def make_review(apply_fix):
    rev = AIAgent(model=main.model, provider=main.provider, api_mode=main.api_mode,
                  base_url=main.base_url, api_key=os.environ["ANTHROPIC_API_KEY"],
                  quiet_mode=True, max_iterations=1, parent_session_id=main.session_id)
    rev._memory_store = main._memory_store
    rev._memory_enabled = main._memory_enabled
    rev._user_profile_enabled = main._user_profile_enabled
    rev._memory_nudge_interval = 0
    rev._skill_nudge_interval = 0
    if apply_fix:
        rev._cached_system_prompt = main._cached_system_prompt   # THE FIX
    return rev

current_label[0] = "review_NO_fix"
try: make_review(False).run_conversation(user_message=REVIEW_PROMPT, conversation_history=fake_history)
except Exception: pass

current_label[0] = "review_WITH_fix"
try: make_review(True).run_conversation(user_message=REVIEW_PROMPT, conversation_history=fake_history)
except Exception: pass

for r in records:
    print(f"{r['label']:20s}  input={r['input']:5d}  cache_create={r['cache_create']:6d}  cache_read={r['cache_read']:6d}")

Tested on macOS 14 (Darwin 24.6), Python 3.12.

The forked review agent currently rebuilds its system prompt from scratch, producing a different 'Conversation started: ...' minute-precision timestamp than the parent's cached prompt. This invalidates the Anthropic prefix cache for the entire messages_snapshot, causing each background review to re-pay the full input-token cost. Empirically (Sonnet 4.5, ~4300-token prefix): - Without this fix: cache_create=4316, cache_read=0 - With this fix: cache_create=14, cache_read=4302 ~92% per-fork input-token cost reduction; savings scale O(N^2) with conversation length (each fork rereads cumulative history). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WorldWriter · 2026-04-29T03:14:13Z

Self-closing — the fix doesn't actually deliver cache hits in the real spawn path.

End-to-end on this branch with _spawn_background_review firing naturally: review fork still gets cache_create≈88k, cache_read=0, same as without the fix.

Root cause I missed: review agent is built with enabled_toolsets=["memory","skills"] (4 tools) vs main's 16. Anthropic's cache key includes tools (before system in the hierarchy), so identical system bytes still miss when tools differ.

My earlier "E2E" scripts manually built the review fork without passing enabled_toolsets — fork accidentally inherited main's full toolset, cache hit, false positive. The numbers in the PR body don't hold.

If I find a clean way to address the tools dimension, I'll open a fresh PR.

WorldWriter · 2026-04-29T05:06:41Z

Follow-up filed: #17276 — addresses both the system-prompt drift (this PR's original scope) and the tools-schema mismatch I missed here. Real E2E shows cache_read 0 → 94,404 on the review fork (~89% per-call, ~26% per-run cost reduction).

Background review fork is supposed to hit Anthropic's prefix cache on the parent's messages_snapshot, but currently doesn't (cache_read=0 on every fork). Two root causes, fixed in this commit: 1. System prompt is rebuilt at fork time. _cached_system_prompt starts as None, so run_conversation calls _build_system_prompt, which embeds a minute-precision "Conversation started: ..." timestamp. Reviews fire 10+ turns after session start, so the minute differs from main's, producing a 1-character diff that invalidates the byte-exact cache key. Fix: inherit the parent's _cached_system_prompt directly (same idea as #17089, which was self-closed for only fixing this half). 2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for safety. Anthropic's cache key includes `tools`, which sits before `system` in the cache hierarchy, so even byte-identical `system` won't hit when `tools` differs from main's full set. Fix: drop the schema-level restriction so `tools` matches main, and deny non-whitelisted tools at runtime via the existing get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085, already called at all three dispatch sites). Install/clear a thread- local whitelist (added in the previous commit) on the daemon thread. Append a soft constraint to the review prompt so the model knows. Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review): - Per review-call cost: $0.331 → $0.035 (~89% reduction) - End-to-end per run: $0.848 → $0.629 (~26% reduction) - Review fork cache_create / cache_read: 88,385 / 0 → 1,234 / 94,404 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Background review fork is supposed to hit Anthropic's prefix cache on the parent's messages_snapshot, but currently doesn't (cache_read=0 on every fork). Two root causes, fixed in this commit: 1. System prompt is rebuilt at fork time. _cached_system_prompt starts as None, so run_conversation calls _build_system_prompt, which embeds a minute-precision "Conversation started: ..." timestamp. Reviews fire 10+ turns after session start, so the minute differs from main's, producing a 1-character diff that invalidates the byte-exact cache key. Fix: inherit the parent's _cached_system_prompt directly (same idea as NousResearch#17089, which was self-closed for only fixing this half). 2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for safety. Anthropic's cache key includes `tools`, which sits before `system` in the cache hierarchy, so even byte-identical `system` won't hit when `tools` differs from main's full set. Fix: drop the schema-level restriction so `tools` matches main, and deny non-whitelisted tools at runtime via the existing get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085, already called at all three dispatch sites). Install/clear a thread- local whitelist (added in the previous commit) on the daemon thread. Append a soft constraint to the review prompt so the model knows. Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review): - Per review-call cost: $0.331 → $0.035 (~89% reduction) - End-to-end per run: $0.848 → $0.629 (~26% reduction) - Review fork cache_create / cache_read: 88,385 / 0 → 1,234 / 94,404 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alt-glitch added type/perf Performance improvement or optimization P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/memory Memory tool and memory providers labels Apr 28, 2026

WorldWriter changed the title ~~fix(memory): share cached system prompt with review fork (~92% token savings)~~ fix(memory): review fork rebuilds system prompt, invalidating prefix cache (~77% token savings) Apr 29, 2026

WorldWriter changed the title ~~fix(memory): review fork rebuilds system prompt, invalidating prefix cache (~77% token savings)~~ fix(memory): review fork rebuilds system prompt, breaking prefix cache (~26% per-cycle / ~92% per-fork savings) Apr 29, 2026

WorldWriter closed this Apr 29, 2026

WorldWriter mentioned this pull request Apr 29, 2026

fix(memory): restore prefix cache hits in background review fork (~26% token saving per run) #17276

Closed

WorldWriter deleted the fix/review-cache-share-system-prompt branch May 8, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): review fork rebuilds system prompt, breaking prefix cache (~26% per-cycle / ~92% per-fork savings)#17089

fix(memory): review fork rebuilds system prompt, breaking prefix cache (~26% per-cycle / ~92% per-fork savings)#17089
WorldWriter wants to merge 1 commit into
NousResearch:mainfrom
WorldWriter:fix/review-cache-share-system-prompt

WorldWriter commented Apr 28, 2026 •

edited

Loading

Uh oh!

WorldWriter commented Apr 29, 2026

Uh oh!

WorldWriter commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WorldWriter commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Empirical cache-mechanism verification (real Anthropic API + real hermes flow)

Projected to a realistic 10-turn cycle

The divergence is one character

Why safe

Uh oh!

WorldWriter commented Apr 29, 2026

Uh oh!

WorldWriter commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WorldWriter commented Apr 28, 2026 •

edited

Loading