fix(prompt): stabilize system prompt prefix for KV cache reuse by DaMoot · Pull Request #18547 · NousResearch/hermes-agent

DaMoot · 2026-05-01T20:06:25Z

fix(prompt): stabilize system prompt prefix for KV cache reuse

Summary

Two small changes to system prompt construction eliminate spurious KV cache invalidation between turns. The volatile content being removed has no semantic value to the model and was forcing unnecessary cold re-prefill on every turn it changed.

Particularly impactful on llama.cpp servers running hybrid attention models (e.g., Qwen3.5 family with Gated DeltaNet) where --cache-reuse is unavailable and cache restoration relies on byte-identical prefix matching against stored checkpoints.

Root Cause

llama.cpp's KV cache checkpoint restoration matches the new prompt's prefix against existing checkpoints. When the prefix mutates — even by a single character — the matched span shrinks and downstream tokens have to be re-prefilled from the divergence point.

Two pieces of system prompt content were mutating between turns despite no semantic change:

1. tools/memory_tool.py:_render_block() included character-count usage indicators in section headers:

MEMORY (your personal notes) [49% — 1,085/2,200 chars]

Any memory write that bumped the percentage by one point, or shifted the comma position in the character count, invalidated all downstream cache.

2. run_agent.py:_build_system_prompt() regenerated the conversation timestamp via now() on every API call, despite the value semantically representing session start (which doesn't change within a session). Each minute rollover invalidated the cache.

Changes

`run_agent.py`

Use self.session_start for the timestamp instead of now():

-        from hermes_time import now as _hermes_now
-        now = _hermes_now()
-        timestamp_line = f"Conversation started: {now.strftime(...)}"
+        # Use session_start timestamp for stability across turns within the same session.
+        # Regenerating the timestamp on every API call would invalidate the KV cache
+        # prefix even though the session hasn't actually started at a different time.
+        timestamp_line = f"Conversation started: {self.session_start.strftime(...)}"

session_start is what the line semantically claims to display anyway — this is also a correctness fix.

`tools/memory_tool.py`

Remove volatile usage indicators from memory block headers:

-        limit = self._char_limit(target)
         content = ENTRY_DELIMITER.join(entries)
-        current = len(content)
-        pct = min(100, int((current / limit) * 100)) if limit > 0 else 0
-
         if target == "user":
-            header = f"USER PROFILE (who the user is) [{pct}% — {current:,}/{limit:,} chars]"
+            header = "USER PROFILE (who the user is)"
         else:
-            header = f"MEMORY (your personal notes) [{pct}% — {current:,}/{limit:,} chars]"
+            header = "MEMORY (your personal notes)"

Usage statistics remain available in the memory tool's response payload via _success_response() — they're just no longer surfaced in the system prompt where they cost cache stability.

Evidence

Tested on a llama.cpp b8683-d0a6dfeb2 server running Qwen3.5-27B-UD-Q5_K_XL on a Tesla V100-SXM2-32GB.

Before: llama-server logs during multi-turn conversations showed prefix similarity collapsing whenever a memory write or minute rollover shifted bytes in the prompt, despite no semantic change in the conversation:

srv  load: looking for better prompt, base f_keep = 0.000, sim = 0.012
slot update_slots: n_tokens = 0, memory_seq_rm [0, end)

That's a full cold prefill — for a 22K-token conversation prefix on the V100, roughly 24-30 seconds of wasted work per turn.

After: Successive turns in agentic tool-call chains maintain near-perfect prefix similarity:

sim_best = 0.987, f_keep = 0.987    ← turn 2
sim_best = 0.998, f_keep = 1.000    ← turn 3
sim_best = 0.999, f_keep = 1.000    ← turn 4
sim_best = 0.992, f_keep = 1.000    ← turn 5
sim_best = 0.994, f_keep = 1.000    ← turn 6
sim_best = 0.997, f_keep = 1.000    ← turn 7

Partial invalidation still occurs at semantically meaningful boundaries — e.g., a tool-category switch from email search to OCR dropped similarity to 0.683, which is correct behavior because the recent context genuinely changed.

Scope and Related Concerns

This PR addresses prompt-construction sources of cache invalidation only.

A related but architecturally distinct issue exists in agent/auxiliary_client.py: auxiliary tasks (title generation, summarization) default to provider: "auto", which routes them through the main model endpoint. On local llama.cpp deployments this destroys the main conversation's checkpoint state on every call (~1.2% prefix similarity between the title generation prompt and a typical conversation context). That's a separate concern that warrants its own PR — the right fix is probably to default auxiliary.* tasks to provider: none or to a separate endpoint when one is configured, with a logged warning if they're routed to the main endpoint.

In real-world combined testing on the test setup — applying both these prompt fixes and the workaround auxiliary.title_generation.provider: none — multi-turn tool-call chain runtime dropped from roughly 7-9 minutes to 3-4 minutes on equivalent workloads. The split between the two fixes hasn't been isolated, but both contribute meaningfully and both should land.

Backwards Compatibility

No breaking changes:

Memory headers remain present and human-readable, just without inline usage stats
Timestamp still displays (using session_start, which is what the line says it represents)
Tool response payloads unchanged
Memory usage stats remain available via the memory tool's response

Testing

Manual verification confirms the fixes are correctly applied:

# Confirm volatile counters removed from memory headers
grep -A15 "def _render_block" tools/memory_tool.py

# Confirm timestamp uses session_start
grep -B3 "timestamp_line = f\"Conversation started" run_agent.py

End-to-end validation via llama-server log analysis on multi-turn tool-call chains over a 26K-token context. Post-fix, cache invalidation occurs only at legitimate context boundaries (tool category change, context truncation), not on semantically inert prompt mutations.

To observe cache behaviour live on a running llama-server instance:

tail -f /var/log/llama-server.log | grep -E "memory_seq_rm|found better|f_keep|sim =|erased invalidated"

Healthy indicators: sim_best > 0.95 on consecutive turns, memory_seq_rm positions advancing rather than resetting to [0, end). Pathological indicator: sim = 0.0xx or memory_seq_rm [0, end) on turns where no semantic context shift occurred.

Files Changed

tools/memory_tool.py — 6 lines removed, 5 lines added (header simplification + docstring)
run_agent.py — 3 lines removed, 4 lines added (timestamp source + comment)

Two pieces of system prompt content were mutating between turns despite no semantic change, invalidating the llama.cpp KV cache prefix and forcing unnecessary re-prefill on every turn: 1. Memory block headers (tools/memory_tool.py:_render_block) included character-count percentage indicators that shifted bytes whenever memory size changed enough to alter the percentage or comma placement. 2. The "Conversation started:" timestamp (run_agent.py:_build_system_prompt) was regenerated via now() on every API call, despite semantically representing session start. This change uses self.session_start for the timestamp and removes the inline usage indicators from memory headers. Memory usage stats remain available in tool responses via _success_response(). Particularly impactful on hybrid attention models (e.g., Qwen3.5 family with Gated DeltaNet) where llama.cpp's --cache-reuse flag is unavailable and the cache depends on byte-identical prefix matching for checkpoint restoration. On local sessions, this can save seconds per turn and several minutes over a multi-step task. When a tiny prompt detail changes near the front, llama.cpp may have to reread and recompute much of the prompt instead of reusing the cache. The main cost is repeated prefill work, with some extra memory traffic from rebuilding the KV cache. Validated on Qwen3.5-27B on llama.cpp with log analysis showing post-fix turn-to-turn prefix similarity of 0.987-0.999 vs. pre-fix similarity collapsing to <0.05 on memory writes or minute rollovers. Files modified: - tools/memory_tool.py - run_agent.py

alt-glitch · 2026-05-01T20:19:25Z

Likely duplicate of #8689 — same fix: use session_start instead of now() for system prompt timestamp stability. This PR also addresses the memory char-count header mutation which #8689 may not cover, but the timestamp fix is identical.

alt-glitch · 2026-05-01T20:19:56Z

Likely duplicate of #8689 — same fix: use session_start instead of now() for system prompt timestamp stability.

liuhao1024

The timestamp stabilization (self.session_start instead of _hermes_now()) is the correct and effective fix for KV cache reuse — nice catch.

However, the memory header percentage removal in _render_block does not contribute to KV cache stability. The _system_prompt_snapshot is frozen at load_from_disk() time (line 112: "frozen at load time, used for system prompt injection") and the percentage is a deterministic function of content length and char_limit — both of which are fixed once the snapshot is created. The percentage cannot change between turns within a session, so it was already stable.

Removing it is a net-negative change: it strips useful diagnostic information (memory usage pressure visible to the agent in the system prompt) with zero cache benefit. I would recommend reverting the _render_block changes in memory_tool.py and keeping only the timestamp fix in run_agent.py.

liuhao1024

Confirmed: self.session_start is initialized at run_agent.py:1600 (self.session_start = datetime.now()), so the attribute lookup is safe. The hermes_time.now import on the current main branch is at line 4949 inside _build_system_prompt — removing it is a clean local-only change.

Both fixes are correct:

Timestamp — "Conversation started: {now()}" was semantically wrong (it claimed to show when the conversation started but actually showed the current time on every turn). Using self.session_start is both a correctness fix and a cache stability win.
Memory headers — the [49% — 1,085/2,200 chars] indicators mutated on every memory write (even adding a single character could shift the percentage by 1 point or move a comma), invalidating the entire downstream KV prefix. Removing them from the system prompt while keeping them in _success_response() tool payloads is the right trade-off.

This also benefits inference servers beyond llama.cpp that do prefix caching (vLLM's automatic prefix caching, TGI's --prefix-caching, SGLang's RadixAttention) — any byte-identical prefix matching benefits from eliminating these mutations.

Community PRs applied: - NousResearch#18596: Enable secret redaction by default (SECURITY) - NousResearch#18650: Sanitize malformed tool messages + auto-recover on API 400 - NousResearch#18607: Emergency compression before max_iterations exhaustion - NousResearch#18603: Compression fallback to main model on 413 rate limit - NousResearch#18638: Pass threshold_percent on model switch - NousResearch#18663: Strip extra_content from tool_calls for strict APIs - NousResearch#18618: Forward explicit_api_key to OpenRouter - NousResearch#18632: Show cache tokens in /insights breakdown - NousResearch#18614: Add idempotency guard for patch duplicate loops - NousResearch#18600: Raise ValueError when HERMES_HOME unset in profile mode - NousResearch#18616: Allow ZWJ emoji in context files - NousResearch#18582: Reload .env on /restart - NousResearch#18547: Stabilize system prompt prefix for KV cache reuse - NousResearch#18692: Strip FTS5 operators from session search truncation terms Fix: Add order_by_last_active=True to list_sessions_rich call (pre-existing commit 142b4bf code sync)

@iamfoz

…ogging The system prompt's 'Conversation started:' line carried minute precision (%I:%M %p), making it byte-unstable across every rebuild path. Within a CLI session the in-memory cache held, but on the gateway path (fresh AIAgent per turn → restore from session DB), any silent failure in the read or write path dropped the cache stem and forced a full re-prefill on every subsequent turn. Local prefix-caching backends (llama.cpp / vLLM) saw this as KV-cache invalidation; remote prefix-caching providers saw it as an Anthropic-style cache miss. Three changes: 1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM'). System prompt now byte-stable for the full day. The model can still query exact time via tools when it actually needs it. Credit: @iamfoz (PR #20451). 2. Loud logging on session DB write failures. The update_system_prompt call used to log at DEBUG, hiding disk-full / locked-database / schema drift behind a silent fall-through that forced fresh rebuilds on every subsequent turn. Now WARN with the session id and exception so persistent issues show up in agent.log without verbose mode. 3. Three-way stored-state distinction on read. The previous 'session_row.get("system_prompt") or None' collapsed three states into one (missing row / null column / empty string). Now we tell them apart and WARN when a continuing session lands on null/empty (which means the previous turn's write never persisted — every subsequent turn rebuilds and the prefix cache misses every time). The restore block is extracted into _restore_or_build_system_prompt() so the prefix-cache path can be unit-tested in isolation. E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary sleep restores byte-identical bytes from the session DB. NULL stored prompt fires the new warning. Date-only timestamp survives the rebuild path. All on real SessionDB, no mocks. Tests: - tests/agent/test_system_prompt_restore.py (10 new tests) - tests/run_agent/test_run_agent.py::TestBuildSystemPrompt:: test_datetime_is_date_only_not_minute_precision Closes #20451 (date-only), #18547 (prefix stabilization), #8689 (stabilize timestamp across compression), #15866 (timestamp caching question), #8687 (compression timestamp), #27339 (claim #3: live timestamp in cached system prompt). Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>

teknium1 · 2026-05-18T06:21:08Z

Superseded by PR #27675 (merged commit 4a3f13b), which makes the system prompt byte-stable for the full day via a date-only line. Your analysis of llama.cpp's KV-cache restoration depending on byte-identical prefix matching was on the money — the date-only timestamp closes the minute-by-minute volatility source you identified. Thanks for the detailed root-cause writeup.

teknium1 · 2026-05-18T06:21:29Z

Superseded by PR #27675 (merged commit 4a3f13b), which makes the system prompt byte-stable for the full day via a date-only Conversation started: line. Your analysis of llama.cpp's KV-cache restoration depending on byte-identical prefix matching was on the money — the date-only timestamp closes the minute-by-minute volatility source you identified. Thanks for the detailed root-cause writeup.

@iamfoz

…ogging The system prompt's 'Conversation started:' line carried minute precision (%I:%M %p), making it byte-unstable across every rebuild path. Within a CLI session the in-memory cache held, but on the gateway path (fresh AIAgent per turn → restore from session DB), any silent failure in the read or write path dropped the cache stem and forced a full re-prefill on every subsequent turn. Local prefix-caching backends (llama.cpp / vLLM) saw this as KV-cache invalidation; remote prefix-caching providers saw it as an Anthropic-style cache miss. Three changes: 1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM'). System prompt now byte-stable for the full day. The model can still query exact time via tools when it actually needs it. Credit: @iamfoz (PR NousResearch#20451). 2. Loud logging on session DB write failures. The update_system_prompt call used to log at DEBUG, hiding disk-full / locked-database / schema drift behind a silent fall-through that forced fresh rebuilds on every subsequent turn. Now WARN with the session id and exception so persistent issues show up in agent.log without verbose mode. 3. Three-way stored-state distinction on read. The previous 'session_row.get("system_prompt") or None' collapsed three states into one (missing row / null column / empty string). Now we tell them apart and WARN when a continuing session lands on null/empty (which means the previous turn's write never persisted — every subsequent turn rebuilds and the prefix cache misses every time). The restore block is extracted into _restore_or_build_system_prompt() so the prefix-cache path can be unit-tested in isolation. E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary sleep restores byte-identical bytes from the session DB. NULL stored prompt fires the new warning. Date-only timestamp survives the rebuild path. All on real SessionDB, no mocks. Tests: - tests/agent/test_system_prompt_restore.py (10 new tests) - tests/run_agent/test_run_agent.py::TestBuildSystemPrompt:: test_datetime_is_date_only_not_minute_precision Closes NousResearch#20451 (date-only), NousResearch#18547 (prefix stabilization), NousResearch#8689 (stabilize timestamp across compression), NousResearch#15866 (timestamp caching question), NousResearch#8687 (compression timestamp), NousResearch#27339 (claim NousResearch#3: live timestamp in cached system prompt). Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>

@iamfoz

…ogging The system prompt's 'Conversation started:' line carried minute precision (%I:%M %p), making it byte-unstable across every rebuild path. Within a CLI session the in-memory cache held, but on the gateway path (fresh AIAgent per turn → restore from session DB), any silent failure in the read or write path dropped the cache stem and forced a full re-prefill on every subsequent turn. Local prefix-caching backends (llama.cpp / vLLM) saw this as KV-cache invalidation; remote prefix-caching providers saw it as an Anthropic-style cache miss. Three changes: 1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM'). System prompt now byte-stable for the full day. The model can still query exact time via tools when it actually needs it. Credit: @iamfoz (PR NousResearch#20451). 2. Loud logging on session DB write failures. The update_system_prompt call used to log at DEBUG, hiding disk-full / locked-database / schema drift behind a silent fall-through that forced fresh rebuilds on every subsequent turn. Now WARN with the session id and exception so persistent issues show up in agent.log without verbose mode. 3. Three-way stored-state distinction on read. The previous 'session_row.get("system_prompt") or None' collapsed three states into one (missing row / null column / empty string). Now we tell them apart and WARN when a continuing session lands on null/empty (which means the previous turn's write never persisted — every subsequent turn rebuilds and the prefix cache misses every time). The restore block is extracted into _restore_or_build_system_prompt() so the prefix-cache path can be unit-tested in isolation. E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary sleep restores byte-identical bytes from the session DB. NULL stored prompt fires the new warning. Date-only timestamp survives the rebuild path. All on real SessionDB, no mocks. Tests: - tests/agent/test_system_prompt_restore.py (10 new tests) - tests/run_agent/test_run_agent.py::TestBuildSystemPrompt:: test_datetime_is_date_only_not_minute_precision Closes NousResearch#20451 (date-only), NousResearch#18547 (prefix stabilization), NousResearch#8689 (stabilize timestamp across compression), NousResearch#15866 (timestamp caching question), NousResearch#8687 (compression timestamp), NousResearch#27339 (claim NousResearch#3: live timestamp in cached system prompt). Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>

alt-glitch added type/perf Performance improvement or optimization P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/memory Memory tool and memory providers labels May 1, 2026

liuhao1024 reviewed May 1, 2026

View reviewed changes

liuhao1024 approved these changes May 1, 2026

View reviewed changes

This was referenced May 17, 2026

[Bug]: Prompt Cache / KV Cache Invalidation on Follow-Up Messages Due to Dynamic Tool Shuffling #27339

Closed

perf(prompt-cache): date-only timestamp + loud gateway-DB roundtrip logging #27675

Merged

teknium1 closed this May 18, 2026

This was referenced May 19, 2026

perf(dashboard): gzip static files + long-term cache headers + plugin cache-bust sea-monsters/hermes-agent#1

Merged

perf(dashboard): gzip static files + long-term cache headers + plugin cache-bust #28543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prompt): stabilize system prompt prefix for KV cache reuse#18547

fix(prompt): stabilize system prompt prefix for KV cache reuse#18547
DaMoot wants to merge 1 commit into
NousResearch:mainfrom
DaMoot:fix/prompt-prefix-stability

DaMoot commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

liuhao1024 left a comment

Uh oh!

liuhao1024 left a comment

Uh oh!

teknium1 commented May 18, 2026

Uh oh!

teknium1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DaMoot commented May 1, 2026

fix(prompt): stabilize system prompt prefix for KV cache reuse

Summary

Root Cause

Changes

run_agent.py

tools/memory_tool.py

Evidence

Scope and Related Concerns

Backwards Compatibility

Testing

Files Changed

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

liuhao1024 left a comment

Choose a reason for hiding this comment

Uh oh!

liuhao1024 left a comment

Choose a reason for hiding this comment

Uh oh!

teknium1 commented May 18, 2026

Uh oh!

teknium1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`run_agent.py`

`tools/memory_tool.py`