Skip to content

fix(memory): review fork rebuilds system prompt, breaking prefix cache (~26% per-cycle / ~92% per-fork savings)#17089

Closed
WorldWriter wants to merge 1 commit into
NousResearch:mainfrom
WorldWriter:fix/review-cache-share-system-prompt
Closed

fix(memory): review fork rebuilds system prompt, breaking prefix cache (~26% per-cycle / ~92% per-fork savings)#17089
WorldWriter wants to merge 1 commit into
NousResearch:mainfrom
WorldWriter:fix/review-cache-share-system-prompt

Conversation

@WorldWriter

@WorldWriter WorldWriter commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

~26% reduction on full cycle cost (10 main turns + 1 review fork) and ~92% on the review fork itself, in a typical 10-turn session against Sonnet 4.5. One-line fix: review fork now inherits parent's _cached_system_prompt instead of rebuilding it with a drifted timestamp that invalidates the entire prefix cache.

_spawn_background_review rebuilds the review fork's system prompt via _build_system_prompt, which calls hermes_time.now() at minute precision. The 1-minute drift between session start and review fire-time invalidates the Anthropic prefix cache for the entire messages_snapshot. This silently violates the project's own caching policy (see AGENTS.md):

"Hermes-Agent ensures caching remains valid throughout a conversation. Do NOT implement changes that would [...] reload memories or rebuild system prompts mid-conversation. Cache-breaking forces dramatically higher costs."

Empirical cache-mechanism verification (real Anthropic API + real hermes flow)

Driving AIAgent.run_conversation() against the live Anthropic API (Sonnet 4.5), captured via Messages.stream interception. One main turn primes the cache, then review fires with a 10-turn messages_snapshot. This isolates the system-prompt-drift effect: the only variable across the two review calls is whether the fix is applied.

Call input cache_create cache_read output
Main agent (1 turn, primes cache) 3 14,849 0 4
Review WITHOUT fix 3 17,386 0 7
Review WITH fix 3 2,864 14,522 7

Pricing (Sonnet 4.5, per 1M tokens): input $3, output $15, cache_write $3.75, cache_read $0.30.

The cache_read jumping from 0 to 14,522 confirms the fix restores prefix-cache hits exactly as predicted; the 14,522 cached tokens correspond to the system prompt that main has already cached. (This anchor is single-turn-prime, so main hadn't cached the messages yet — review still pays for those.)

Projected to a realistic 10-turn cycle

In real sessions main runs ~10 turns before the review fires (default nudge_interval=10), so the messages also enter main's cache. Review WITH fix then hits the whole prefix, not just system. Building the per-turn ledger from the empirical anchor — assuming each main turn adds 200 input + 500 output tokens of substantive work:

Stage input cache_create cache_read output $
Main 10 turns (cumulative) 2,000 20,822 155,898 5,000 $0.2057
Review WITHOUT fix 0 21,572 0 10 $0.0810
Review WITH fix 0 50 21,522 10 $0.0068
Main Review Cycle total
WITHOUT fix $0.2057 $0.0810 $0.2867
WITH fix $0.2057 $0.0068 $0.2125
Saved $0.0742 $0.0742 (~26%)

Per-fork savings reaches ~92% because by turn 10 main has pre-cached system + the entire message history; review with the fix only pays for REVIEW_PROMPT itself. Cycle savings sits at ~26%.

Sensitivity to per-turn output verbosity:

Per-turn output Main 10-turn $ Cycle savings
100 tok $0.146 ~33%
500 tok (baseline above) $0.206 ~26%
1,000 tok $0.281 ~21%
2,000 tok $0.431 ~15%

Triggers fire every 10/20/30/... turns; review cost scales O(N²) with conversation length while main scales O(N), so cycle savings ratio rises rather than dilutes over long sessions.

The divergence is one character

Two AIAgent instances 65 s apart calling _build_system_prompt():

-Conversation started: Wednesday, April 29, 2026 12:57 AM
+Conversation started: Wednesday, April 29, 2026 12:58 AM

That single character is the entire bug surface. The fix makes review's _cached_system_prompt byte-identical to main's, restoring prefix cache hits.

Why safe

  • _cached_system_prompt is a str — no shared mutable state
  • Sharing parent's "Conversation started" timestamp is more correct semantically (it timestamps the conversation under review, not when review fires)
  • Mirrors existing pattern: _memory_store, _memory_enabled, _user_profile_enabled already inherited at the same site
End-to-end repro script (~$0.05 to run)
"""End-to-end hermes test: run_conversation against real Anthropic API."""
import os, sys, time, threading
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

import anthropic
records = []
records_lock = threading.Lock()
current_label = ["unknown"]
_orig_stream = anthropic.resources.messages.messages.Messages.stream

def patched_stream(self, **kw):
    cm = _orig_stream(self, **kw)
    label = current_label[0]
    class _W:
        def __enter__(_i):
            _i._s = cm.__enter__(); return _i._s
        def __exit__(_i, *exc):
            try:
                u = _i._s.get_final_message().usage
                with records_lock:
                    records.append({"label": label, "input": u.input_tokens,
                        "cache_create": getattr(u, "cache_creation_input_tokens", 0) or 0,
                        "cache_read": getattr(u, "cache_read_input_tokens", 0) or 0})
            except Exception: pass
            return cm.__exit__(*exc)
    return _W()
anthropic.resources.messages.messages.Messages.stream = patched_stream

from run_agent import AIAgent

main = AIAgent(model="claude-sonnet-4-5-20250929",
               provider="anthropic", api_mode="anthropic_messages",
               base_url="https://api.anthropic.com",
               api_key=os.environ["ANTHROPIC_API_KEY"],
               quiet_mode=True, max_iterations=1)

current_label[0] = "main_turn_1"
main.run_conversation(user_message="Reply 'ok' only.", conversation_history=[])
time.sleep(65)  # force minute-precision timestamp drift

fake_history = []
for i in range(10):
    fake_history.append({"role": "user", "content": (f"Q{i}: " + ("blah " * 60)).strip()})
    fake_history.append({"role": "assistant", "content": (f"A{i}: " + ("yada " * 60)).strip()})
REVIEW_PROMPT = "Review the conversation above. Just say 'Nothing to save.' and stop."

def make_review(apply_fix):
    rev = AIAgent(model=main.model, provider=main.provider, api_mode=main.api_mode,
                  base_url=main.base_url, api_key=os.environ["ANTHROPIC_API_KEY"],
                  quiet_mode=True, max_iterations=1, parent_session_id=main.session_id)
    rev._memory_store = main._memory_store
    rev._memory_enabled = main._memory_enabled
    rev._user_profile_enabled = main._user_profile_enabled
    rev._memory_nudge_interval = 0
    rev._skill_nudge_interval = 0
    if apply_fix:
        rev._cached_system_prompt = main._cached_system_prompt   # THE FIX
    return rev

current_label[0] = "review_NO_fix"
try: make_review(False).run_conversation(user_message=REVIEW_PROMPT, conversation_history=fake_history)
except Exception: pass

current_label[0] = "review_WITH_fix"
try: make_review(True).run_conversation(user_message=REVIEW_PROMPT, conversation_history=fake_history)
except Exception: pass

for r in records:
    print(f"{r['label']:20s}  input={r['input']:5d}  cache_create={r['cache_create']:6d}  cache_read={r['cache_read']:6d}")

Tested on macOS 14 (Darwin 24.6), Python 3.12.

The forked review agent currently rebuilds its system prompt from scratch,
producing a different 'Conversation started: ...' minute-precision
timestamp than the parent's cached prompt. This invalidates the Anthropic
prefix cache for the entire messages_snapshot, causing each background
review to re-pay the full input-token cost.

Empirically (Sonnet 4.5, ~4300-token prefix):
  - Without this fix: cache_create=4316, cache_read=0
  - With this fix:    cache_create=14,   cache_read=4302

~92% per-fork input-token cost reduction; savings scale O(N^2) with
conversation length (each fork rereads cumulative history).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alt-glitch alt-glitch added type/perf Performance improvement or optimization P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/memory Memory tool and memory providers labels Apr 28, 2026
@WorldWriter WorldWriter changed the title fix(memory): share cached system prompt with review fork (~92% token savings) fix(memory): review fork rebuilds system prompt, invalidating prefix cache (~77% token savings) Apr 29, 2026
@WorldWriter WorldWriter changed the title fix(memory): review fork rebuilds system prompt, invalidating prefix cache (~77% token savings) fix(memory): review fork rebuilds system prompt, breaking prefix cache (~26% per-cycle / ~92% per-fork savings) Apr 29, 2026
@WorldWriter

Copy link
Copy Markdown
Contributor Author

Self-closing — the fix doesn't actually deliver cache hits in the real spawn path.

End-to-end on this branch with _spawn_background_review firing naturally: review fork still gets cache_create≈88k, cache_read=0, same as without the fix.

Root cause I missed: review agent is built with enabled_toolsets=["memory","skills"] (4 tools) vs main's 16. Anthropic's cache key includes tools (before system in the hierarchy), so identical system bytes still miss when tools differ.

My earlier "E2E" scripts manually built the review fork without passing enabled_toolsets — fork accidentally inherited main's full toolset, cache hit, false positive. The numbers in the PR body don't hold.

If I find a clean way to address the tools dimension, I'll open a fresh PR.

@WorldWriter

Copy link
Copy Markdown
Contributor Author

Follow-up filed: #17276 — addresses both the system-prompt drift (this PR's original scope) and the tools-schema mismatch I missed here. Real E2E shows cache_read 0 → 94,404 on the review fork (~89% per-call, ~26% per-run cost reduction).

@WorldWriter WorldWriter deleted the fix/review-cache-share-system-prompt branch May 8, 2026 23:29
teknium1 pushed a commit that referenced this pull request May 14, 2026
Background review fork is supposed to hit Anthropic's prefix cache on the
parent's messages_snapshot, but currently doesn't (cache_read=0 on every
fork). Two root causes, fixed in this commit:

1. System prompt is rebuilt at fork time. _cached_system_prompt starts as
   None, so run_conversation calls _build_system_prompt, which embeds a
   minute-precision "Conversation started: ..." timestamp. Reviews fire
   10+ turns after session start, so the minute differs from main's,
   producing a 1-character diff that invalidates the byte-exact cache key.
   Fix: inherit the parent's _cached_system_prompt directly (same idea as
   #17089, which was self-closed for only fixing this half).

2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for
   safety. Anthropic's cache key includes `tools`, which sits before
   `system` in the cache hierarchy, so even byte-identical `system` won't
   hit when `tools` differs from main's full set.
   Fix: drop the schema-level restriction so `tools` matches main, and
   deny non-whitelisted tools at runtime via the existing
   get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085,
   already called at all three dispatch sites). Install/clear a thread-
   local whitelist (added in the previous commit) on the daemon thread.
   Append a soft constraint to the review prompt so the model knows.

Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review):
- Per review-call cost: $0.331 → $0.035 (~89% reduction)
- End-to-end per run:   $0.848 → $0.629 (~26% reduction)
- Review fork cache_create / cache_read: 88,385 / 0  →  1,234 / 94,404

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sunJose pushed a commit to sunJose/hermes-agent that referenced this pull request May 14, 2026
Background review fork is supposed to hit Anthropic's prefix cache on the
parent's messages_snapshot, but currently doesn't (cache_read=0 on every
fork). Two root causes, fixed in this commit:

1. System prompt is rebuilt at fork time. _cached_system_prompt starts as
   None, so run_conversation calls _build_system_prompt, which embeds a
   minute-precision "Conversation started: ..." timestamp. Reviews fire
   10+ turns after session start, so the minute differs from main's,
   producing a 1-character diff that invalidates the byte-exact cache key.
   Fix: inherit the parent's _cached_system_prompt directly (same idea as
   NousResearch#17089, which was self-closed for only fixing this half).

2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for
   safety. Anthropic's cache key includes `tools`, which sits before
   `system` in the cache hierarchy, so even byte-identical `system` won't
   hit when `tools` differs from main's full set.
   Fix: drop the schema-level restriction so `tools` matches main, and
   deny non-whitelisted tools at runtime via the existing
   get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085,
   already called at all three dispatch sites). Install/clear a thread-
   local whitelist (added in the previous commit) on the daemon thread.
   Append a soft constraint to the review prompt so the model knows.

Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review):
- Per review-call cost: $0.331 → $0.035 (~89% reduction)
- End-to-end per run:   $0.848 → $0.629 (~26% reduction)
- Review fork cache_create / cache_read: 88,385 / 0  →  1,234 / 94,404

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
Background review fork is supposed to hit Anthropic's prefix cache on the
parent's messages_snapshot, but currently doesn't (cache_read=0 on every
fork). Two root causes, fixed in this commit:

1. System prompt is rebuilt at fork time. _cached_system_prompt starts as
   None, so run_conversation calls _build_system_prompt, which embeds a
   minute-precision "Conversation started: ..." timestamp. Reviews fire
   10+ turns after session start, so the minute differs from main's,
   producing a 1-character diff that invalidates the byte-exact cache key.
   Fix: inherit the parent's _cached_system_prompt directly (same idea as
   NousResearch#17089, which was self-closed for only fixing this half).

2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for
   safety. Anthropic's cache key includes `tools`, which sits before
   `system` in the cache hierarchy, so even byte-identical `system` won't
   hit when `tools` differs from main's full set.
   Fix: drop the schema-level restriction so `tools` matches main, and
   deny non-whitelisted tools at runtime via the existing
   get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085,
   already called at all three dispatch sites). Install/clear a thread-
   local whitelist (added in the previous commit) on the daemon thread.
   Append a soft constraint to the review prompt so the model knows.

Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review):
- Per review-call cost: $0.331 → $0.035 (~89% reduction)
- End-to-end per run:   $0.848 → $0.629 (~26% reduction)
- Review fork cache_create / cache_read: 88,385 / 0  →  1,234 / 94,404

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AlexFoxD pushed a commit to AlexFoxD/hermes-agent that referenced this pull request May 21, 2026
Background review fork is supposed to hit Anthropic's prefix cache on the
parent's messages_snapshot, but currently doesn't (cache_read=0 on every
fork). Two root causes, fixed in this commit:

1. System prompt is rebuilt at fork time. _cached_system_prompt starts as
   None, so run_conversation calls _build_system_prompt, which embeds a
   minute-precision "Conversation started: ..." timestamp. Reviews fire
   10+ turns after session start, so the minute differs from main's,
   producing a 1-character diff that invalidates the byte-exact cache key.
   Fix: inherit the parent's _cached_system_prompt directly (same idea as
   NousResearch#17089, which was self-closed for only fixing this half).

2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for
   safety. Anthropic's cache key includes `tools`, which sits before
   `system` in the cache hierarchy, so even byte-identical `system` won't
   hit when `tools` differs from main's full set.
   Fix: drop the schema-level restriction so `tools` matches main, and
   deny non-whitelisted tools at runtime via the existing
   get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085,
   already called at all three dispatch sites). Install/clear a thread-
   local whitelist (added in the previous commit) on the daemon thread.
   Append a soft constraint to the review prompt so the model knows.

Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review):
- Per review-call cost: $0.331 → $0.035 (~89% reduction)
- End-to-end per run:   $0.848 → $0.629 (~26% reduction)
- Review fork cache_create / cache_read: 88,385 / 0  →  1,234 / 94,404

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Background review fork is supposed to hit Anthropic's prefix cache on the
parent's messages_snapshot, but currently doesn't (cache_read=0 on every
fork). Two root causes, fixed in this commit:

1. System prompt is rebuilt at fork time. _cached_system_prompt starts as
   None, so run_conversation calls _build_system_prompt, which embeds a
   minute-precision "Conversation started: ..." timestamp. Reviews fire
   10+ turns after session start, so the minute differs from main's,
   producing a 1-character diff that invalidates the byte-exact cache key.
   Fix: inherit the parent's _cached_system_prompt directly (same idea as
   NousResearch#17089, which was self-closed for only fixing this half).

2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for
   safety. Anthropic's cache key includes `tools`, which sits before
   `system` in the cache hierarchy, so even byte-identical `system` won't
   hit when `tools` differs from main's full set.
   Fix: drop the schema-level restriction so `tools` matches main, and
   deny non-whitelisted tools at runtime via the existing
   get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085,
   already called at all three dispatch sites). Install/clear a thread-
   local whitelist (added in the previous commit) on the daemon thread.
   Append a soft constraint to the review prompt so the model knows.

Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review):
- Per review-call cost: $0.331 → $0.035 (~89% reduction)
- End-to-end per run:   $0.848 → $0.629 (~26% reduction)
- Review fork cache_create / cache_read: 88,385 / 0  →  1,234 / 94,404

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists tool/memory Memory tool and memory providers type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants