Skip to content

fix(prompt): stabilize system prompt prefix for KV cache reuse#18547

Closed
DaMoot wants to merge 1 commit into
NousResearch:mainfrom
DaMoot:fix/prompt-prefix-stability
Closed

fix(prompt): stabilize system prompt prefix for KV cache reuse#18547
DaMoot wants to merge 1 commit into
NousResearch:mainfrom
DaMoot:fix/prompt-prefix-stability

Conversation

@DaMoot

@DaMoot DaMoot commented May 1, 2026

Copy link
Copy Markdown

fix(prompt): stabilize system prompt prefix for KV cache reuse

Summary

Two small changes to system prompt construction eliminate spurious KV cache invalidation between turns. The volatile content being removed has no semantic value to the model and was forcing unnecessary cold re-prefill on every turn it changed.

Particularly impactful on llama.cpp servers running hybrid attention models (e.g., Qwen3.5 family with Gated DeltaNet) where --cache-reuse is unavailable and cache restoration relies on byte-identical prefix matching against stored checkpoints.

Root Cause

llama.cpp's KV cache checkpoint restoration matches the new prompt's prefix against existing checkpoints. When the prefix mutates — even by a single character — the matched span shrinks and downstream tokens have to be re-prefilled from the divergence point.

Two pieces of system prompt content were mutating between turns despite no semantic change:

1. tools/memory_tool.py:_render_block() included character-count usage indicators in section headers:

MEMORY (your personal notes) [49% — 1,085/2,200 chars]

Any memory write that bumped the percentage by one point, or shifted the comma position in the character count, invalidated all downstream cache.

2. run_agent.py:_build_system_prompt() regenerated the conversation timestamp via now() on every API call, despite the value semantically representing session start (which doesn't change within a session). Each minute rollover invalidated the cache.

Changes

run_agent.py

Use self.session_start for the timestamp instead of now():

-        from hermes_time import now as _hermes_now
-        now = _hermes_now()
-        timestamp_line = f"Conversation started: {now.strftime(...)}"
+        # Use session_start timestamp for stability across turns within the same session.
+        # Regenerating the timestamp on every API call would invalidate the KV cache
+        # prefix even though the session hasn't actually started at a different time.
+        timestamp_line = f"Conversation started: {self.session_start.strftime(...)}"

session_start is what the line semantically claims to display anyway — this is also a correctness fix.

tools/memory_tool.py

Remove volatile usage indicators from memory block headers:

-        limit = self._char_limit(target)
         content = ENTRY_DELIMITER.join(entries)
-        current = len(content)
-        pct = min(100, int((current / limit) * 100)) if limit > 0 else 0
-
         if target == "user":
-            header = f"USER PROFILE (who the user is) [{pct}% — {current:,}/{limit:,} chars]"
+            header = "USER PROFILE (who the user is)"
         else:
-            header = f"MEMORY (your personal notes) [{pct}% — {current:,}/{limit:,} chars]"
+            header = "MEMORY (your personal notes)"

Usage statistics remain available in the memory tool's response payload via _success_response() — they're just no longer surfaced in the system prompt where they cost cache stability.

Evidence

Tested on a llama.cpp b8683-d0a6dfeb2 server running Qwen3.5-27B-UD-Q5_K_XL on a Tesla V100-SXM2-32GB.

Before: llama-server logs during multi-turn conversations showed prefix similarity collapsing whenever a memory write or minute rollover shifted bytes in the prompt, despite no semantic change in the conversation:

srv  load: looking for better prompt, base f_keep = 0.000, sim = 0.012
slot update_slots: n_tokens = 0, memory_seq_rm [0, end)

That's a full cold prefill — for a 22K-token conversation prefix on the V100, roughly 24-30 seconds of wasted work per turn.

After: Successive turns in agentic tool-call chains maintain near-perfect prefix similarity:

sim_best = 0.987, f_keep = 0.987    ← turn 2
sim_best = 0.998, f_keep = 1.000    ← turn 3
sim_best = 0.999, f_keep = 1.000    ← turn 4
sim_best = 0.992, f_keep = 1.000    ← turn 5
sim_best = 0.994, f_keep = 1.000    ← turn 6
sim_best = 0.997, f_keep = 1.000    ← turn 7

Partial invalidation still occurs at semantically meaningful boundaries — e.g., a tool-category switch from email search to OCR dropped similarity to 0.683, which is correct behavior because the recent context genuinely changed.

Scope and Related Concerns

This PR addresses prompt-construction sources of cache invalidation only.

A related but architecturally distinct issue exists in agent/auxiliary_client.py: auxiliary tasks (title generation, summarization) default to provider: "auto", which routes them through the main model endpoint. On local llama.cpp deployments this destroys the main conversation's checkpoint state on every call (~1.2% prefix similarity between the title generation prompt and a typical conversation context). That's a separate concern that warrants its own PR — the right fix is probably to default auxiliary.* tasks to provider: none or to a separate endpoint when one is configured, with a logged warning if they're routed to the main endpoint.

In real-world combined testing on the test setup — applying both these prompt fixes and the workaround auxiliary.title_generation.provider: none — multi-turn tool-call chain runtime dropped from roughly 7-9 minutes to 3-4 minutes on equivalent workloads. The split between the two fixes hasn't been isolated, but both contribute meaningfully and both should land.

Backwards Compatibility

No breaking changes:

  • Memory headers remain present and human-readable, just without inline usage stats
  • Timestamp still displays (using session_start, which is what the line says it represents)
  • Tool response payloads unchanged
  • Memory usage stats remain available via the memory tool's response

Testing

Manual verification confirms the fixes are correctly applied:

# Confirm volatile counters removed from memory headers
grep -A15 "def _render_block" tools/memory_tool.py

# Confirm timestamp uses session_start
grep -B3 "timestamp_line = f\"Conversation started" run_agent.py

End-to-end validation via llama-server log analysis on multi-turn tool-call chains over a 26K-token context. Post-fix, cache invalidation occurs only at legitimate context boundaries (tool category change, context truncation), not on semantically inert prompt mutations.

To observe cache behaviour live on a running llama-server instance:

tail -f /var/log/llama-server.log | grep -E "memory_seq_rm|found better|f_keep|sim =|erased invalidated"

Healthy indicators: sim_best > 0.95 on consecutive turns, memory_seq_rm positions advancing rather than resetting to [0, end). Pathological indicator: sim = 0.0xx or memory_seq_rm [0, end) on turns where no semantic context shift occurred.

Files Changed

  • tools/memory_tool.py — 6 lines removed, 5 lines added (header simplification + docstring)
  • run_agent.py — 3 lines removed, 4 lines added (timestamp source + comment)

Two pieces of system prompt content were mutating between turns despite no
semantic change, invalidating the llama.cpp KV cache prefix and forcing
unnecessary re-prefill on every turn:

1. Memory block headers (tools/memory_tool.py:_render_block) included
   character-count percentage indicators that shifted bytes whenever memory
   size changed enough to alter the percentage or comma placement.

2. The "Conversation started:" timestamp (run_agent.py:_build_system_prompt)
   was regenerated via now() on every API call, despite semantically
   representing session start.

This change uses self.session_start for the timestamp and removes the inline
usage indicators from memory headers. Memory usage stats remain available in
tool responses via _success_response().

Particularly impactful on hybrid attention models (e.g., Qwen3.5 family with
Gated DeltaNet) where llama.cpp's --cache-reuse flag is unavailable and the
cache depends on byte-identical prefix matching for checkpoint restoration.

On local sessions, this can save seconds per turn and several minutes over
a multi-step task. When a tiny prompt detail changes near the front, llama.cpp
may have to reread and recompute much of the prompt instead of reusing the
cache. The main cost is repeated prefill work, with some extra memory traffic
from rebuilding the KV cache.

Validated on Qwen3.5-27B on llama.cpp with log analysis showing post-fix
turn-to-turn prefix similarity of 0.987-0.999 vs. pre-fix similarity
collapsing to <0.05 on memory writes or minute rollovers.

Files modified:
- tools/memory_tool.py
- run_agent.py
@alt-glitch alt-glitch added type/perf Performance improvement or optimization P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/memory Memory tool and memory providers labels May 1, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #8689 — same fix: use session_start instead of now() for system prompt timestamp stability. This PR also addresses the memory char-count header mutation which #8689 may not cover, but the timestamp fix is identical.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #8689 — same fix: use session_start instead of now() for system prompt timestamp stability.

@liuhao1024 liuhao1024 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp stabilization (self.session_start instead of _hermes_now()) is the correct and effective fix for KV cache reuse — nice catch.

However, the memory header percentage removal in _render_block does not contribute to KV cache stability. The _system_prompt_snapshot is frozen at load_from_disk() time (line 112: "frozen at load time, used for system prompt injection") and the percentage is a deterministic function of content length and char_limit — both of which are fixed once the snapshot is created. The percentage cannot change between turns within a session, so it was already stable.

Removing it is a net-negative change: it strips useful diagnostic information (memory usage pressure visible to the agent in the system prompt) with zero cache benefit. I would recommend reverting the _render_block changes in memory_tool.py and keeping only the timestamp fix in run_agent.py.

@liuhao1024 liuhao1024 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed: self.session_start is initialized at run_agent.py:1600 (self.session_start = datetime.now()), so the attribute lookup is safe. The hermes_time.now import on the current main branch is at line 4949 inside _build_system_prompt — removing it is a clean local-only change.

Both fixes are correct:

  1. Timestamp"Conversation started: {now()}" was semantically wrong (it claimed to show when the conversation started but actually showed the current time on every turn). Using self.session_start is both a correctness fix and a cache stability win.

  2. Memory headers — the [49% — 1,085/2,200 chars] indicators mutated on every memory write (even adding a single character could shift the percentage by 1 point or move a comma), invalidating the entire downstream KV prefix. Removing them from the system prompt while keeping them in _success_response() tool payloads is the right trade-off.

This also benefits inference servers beyond llama.cpp that do prefix caching (vLLM's automatic prefix caching, TGI's --prefix-caching, SGLang's RadixAttention) — any byte-identical prefix matching benefits from eliminating these mutations.

Cyrene963 pushed a commit to Cyrene963/hermes-agent that referenced this pull request May 3, 2026
Community PRs applied:
- NousResearch#18596: Enable secret redaction by default (SECURITY)
- NousResearch#18650: Sanitize malformed tool messages + auto-recover on API 400
- NousResearch#18607: Emergency compression before max_iterations exhaustion
- NousResearch#18603: Compression fallback to main model on 413 rate limit
- NousResearch#18638: Pass threshold_percent on model switch
- NousResearch#18663: Strip extra_content from tool_calls for strict APIs
- NousResearch#18618: Forward explicit_api_key to OpenRouter
- NousResearch#18632: Show cache tokens in /insights breakdown
- NousResearch#18614: Add idempotency guard for patch duplicate loops
- NousResearch#18600: Raise ValueError when HERMES_HOME unset in profile mode
- NousResearch#18616: Allow ZWJ emoji in context files
- NousResearch#18582: Reload .env on /restart
- NousResearch#18547: Stabilize system prompt prefix for KV cache reuse
- NousResearch#18692: Strip FTS5 operators from session search truncation terms

Fix: Add order_by_last_active=True to list_sessions_rich call
(pre-existing commit 142b4bf code sync)
teknium1 added a commit that referenced this pull request May 18, 2026
…ogging

The system prompt's 'Conversation started:' line carried minute precision
(%I:%M %p), making it byte-unstable across every rebuild path. Within a
CLI session the in-memory cache held, but on the gateway path (fresh
AIAgent per turn → restore from session DB), any silent failure in the
read or write path dropped the cache stem and forced a full re-prefill
on every subsequent turn. Local prefix-caching backends (llama.cpp /
vLLM) saw this as KV-cache invalidation; remote prefix-caching providers
saw it as an Anthropic-style cache miss.

Three changes:

1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM').
   System prompt now byte-stable for the full day. The model can still
   query exact time via tools when it actually needs it. Credit:
   @iamfoz (PR #20451).

2. Loud logging on session DB write failures. The update_system_prompt
   call used to log at DEBUG, hiding disk-full / locked-database / schema
   drift behind a silent fall-through that forced fresh rebuilds on
   every subsequent turn. Now WARN with the session id and exception so
   persistent issues show up in agent.log without verbose mode.

3. Three-way stored-state distinction on read. The previous
   'session_row.get("system_prompt") or None' collapsed three states
   into one (missing row / null column / empty string). Now we tell them
   apart and WARN when a continuing session lands on null/empty (which
   means the previous turn's write never persisted — every subsequent
   turn rebuilds and the prefix cache misses every time).

The restore block is extracted into _restore_or_build_system_prompt()
so the prefix-cache path can be unit-tested in isolation.

E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary
sleep restores byte-identical bytes from the session DB. NULL stored
prompt fires the new warning. Date-only timestamp survives the rebuild
path. All on real SessionDB, no mocks.

Tests:
  - tests/agent/test_system_prompt_restore.py (10 new tests)
  - tests/run_agent/test_run_agent.py::TestBuildSystemPrompt::
        test_datetime_is_date_only_not_minute_precision

Closes #20451 (date-only), #18547 (prefix stabilization),
#8689 (stabilize timestamp across compression), #15866 (timestamp
caching question), #8687 (compression timestamp), #27339
(claim #3: live timestamp in cached system prompt).

Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>
@teknium1

Copy link
Copy Markdown
Contributor

Superseded by PR #27675 (merged commit 4a3f13b), which makes the system prompt byte-stable for the full day via a date-only line. Your analysis of llama.cpp's KV-cache restoration depending on byte-identical prefix matching was on the money — the date-only timestamp closes the minute-by-minute volatility source you identified. Thanks for the detailed root-cause writeup.

@teknium1 teknium1 closed this May 18, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Superseded by PR #27675 (merged commit 4a3f13b), which makes the system prompt byte-stable for the full day via a date-only Conversation started: line. Your analysis of llama.cpp's KV-cache restoration depending on byte-identical prefix matching was on the money — the date-only timestamp closes the minute-by-minute volatility source you identified. Thanks for the detailed root-cause writeup.

Lillard01 pushed a commit to Lillard01/hermes-agent that referenced this pull request May 21, 2026
…ogging

The system prompt's 'Conversation started:' line carried minute precision
(%I:%M %p), making it byte-unstable across every rebuild path. Within a
CLI session the in-memory cache held, but on the gateway path (fresh
AIAgent per turn → restore from session DB), any silent failure in the
read or write path dropped the cache stem and forced a full re-prefill
on every subsequent turn. Local prefix-caching backends (llama.cpp /
vLLM) saw this as KV-cache invalidation; remote prefix-caching providers
saw it as an Anthropic-style cache miss.

Three changes:

1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM').
   System prompt now byte-stable for the full day. The model can still
   query exact time via tools when it actually needs it. Credit:
   @iamfoz (PR NousResearch#20451).

2. Loud logging on session DB write failures. The update_system_prompt
   call used to log at DEBUG, hiding disk-full / locked-database / schema
   drift behind a silent fall-through that forced fresh rebuilds on
   every subsequent turn. Now WARN with the session id and exception so
   persistent issues show up in agent.log without verbose mode.

3. Three-way stored-state distinction on read. The previous
   'session_row.get("system_prompt") or None' collapsed three states
   into one (missing row / null column / empty string). Now we tell them
   apart and WARN when a continuing session lands on null/empty (which
   means the previous turn's write never persisted — every subsequent
   turn rebuilds and the prefix cache misses every time).

The restore block is extracted into _restore_or_build_system_prompt()
so the prefix-cache path can be unit-tested in isolation.

E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary
sleep restores byte-identical bytes from the session DB. NULL stored
prompt fires the new warning. Date-only timestamp survives the rebuild
path. All on real SessionDB, no mocks.

Tests:
  - tests/agent/test_system_prompt_restore.py (10 new tests)
  - tests/run_agent/test_run_agent.py::TestBuildSystemPrompt::
        test_datetime_is_date_only_not_minute_precision

Closes NousResearch#20451 (date-only), NousResearch#18547 (prefix stabilization),
NousResearch#8689 (stabilize timestamp across compression), NousResearch#15866 (timestamp
caching question), NousResearch#8687 (compression timestamp), NousResearch#27339
(claim NousResearch#3: live timestamp in cached system prompt).

Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…ogging

The system prompt's 'Conversation started:' line carried minute precision
(%I:%M %p), making it byte-unstable across every rebuild path. Within a
CLI session the in-memory cache held, but on the gateway path (fresh
AIAgent per turn → restore from session DB), any silent failure in the
read or write path dropped the cache stem and forced a full re-prefill
on every subsequent turn. Local prefix-caching backends (llama.cpp /
vLLM) saw this as KV-cache invalidation; remote prefix-caching providers
saw it as an Anthropic-style cache miss.

Three changes:

1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM').
   System prompt now byte-stable for the full day. The model can still
   query exact time via tools when it actually needs it. Credit:
   @iamfoz (PR NousResearch#20451).

2. Loud logging on session DB write failures. The update_system_prompt
   call used to log at DEBUG, hiding disk-full / locked-database / schema
   drift behind a silent fall-through that forced fresh rebuilds on
   every subsequent turn. Now WARN with the session id and exception so
   persistent issues show up in agent.log without verbose mode.

3. Three-way stored-state distinction on read. The previous
   'session_row.get("system_prompt") or None' collapsed three states
   into one (missing row / null column / empty string). Now we tell them
   apart and WARN when a continuing session lands on null/empty (which
   means the previous turn's write never persisted — every subsequent
   turn rebuilds and the prefix cache misses every time).

The restore block is extracted into _restore_or_build_system_prompt()
so the prefix-cache path can be unit-tested in isolation.

E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary
sleep restores byte-identical bytes from the session DB. NULL stored
prompt fires the new warning. Date-only timestamp survives the rebuild
path. All on real SessionDB, no mocks.

Tests:
  - tests/agent/test_system_prompt_restore.py (10 new tests)
  - tests/run_agent/test_run_agent.py::TestBuildSystemPrompt::
        test_datetime_is_date_only_not_minute_precision

Closes NousResearch#20451 (date-only), NousResearch#18547 (prefix stabilization),
NousResearch#8689 (stabilize timestamp across compression), NousResearch#15866 (timestamp
caching question), NousResearch#8687 (compression timestamp), NousResearch#27339
(claim NousResearch#3: live timestamp in cached system prompt).

Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists tool/memory Memory tool and memory providers type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants