fix(prompt): stabilize system prompt prefix for KV cache reuse#18547
fix(prompt): stabilize system prompt prefix for KV cache reuse#18547DaMoot wants to merge 1 commit into
Conversation
Two pieces of system prompt content were mutating between turns despite no semantic change, invalidating the llama.cpp KV cache prefix and forcing unnecessary re-prefill on every turn: 1. Memory block headers (tools/memory_tool.py:_render_block) included character-count percentage indicators that shifted bytes whenever memory size changed enough to alter the percentage or comma placement. 2. The "Conversation started:" timestamp (run_agent.py:_build_system_prompt) was regenerated via now() on every API call, despite semantically representing session start. This change uses self.session_start for the timestamp and removes the inline usage indicators from memory headers. Memory usage stats remain available in tool responses via _success_response(). Particularly impactful on hybrid attention models (e.g., Qwen3.5 family with Gated DeltaNet) where llama.cpp's --cache-reuse flag is unavailable and the cache depends on byte-identical prefix matching for checkpoint restoration. On local sessions, this can save seconds per turn and several minutes over a multi-step task. When a tiny prompt detail changes near the front, llama.cpp may have to reread and recompute much of the prompt instead of reusing the cache. The main cost is repeated prefill work, with some extra memory traffic from rebuilding the KV cache. Validated on Qwen3.5-27B on llama.cpp with log analysis showing post-fix turn-to-turn prefix similarity of 0.987-0.999 vs. pre-fix similarity collapsing to <0.05 on memory writes or minute rollovers. Files modified: - tools/memory_tool.py - run_agent.py
|
Likely duplicate of #8689 — same fix: use session_start instead of now() for system prompt timestamp stability. |
liuhao1024
left a comment
There was a problem hiding this comment.
The timestamp stabilization (self.session_start instead of _hermes_now()) is the correct and effective fix for KV cache reuse — nice catch.
However, the memory header percentage removal in _render_block does not contribute to KV cache stability. The _system_prompt_snapshot is frozen at load_from_disk() time (line 112: "frozen at load time, used for system prompt injection") and the percentage is a deterministic function of content length and char_limit — both of which are fixed once the snapshot is created. The percentage cannot change between turns within a session, so it was already stable.
Removing it is a net-negative change: it strips useful diagnostic information (memory usage pressure visible to the agent in the system prompt) with zero cache benefit. I would recommend reverting the _render_block changes in memory_tool.py and keeping only the timestamp fix in run_agent.py.
liuhao1024
left a comment
There was a problem hiding this comment.
Confirmed: self.session_start is initialized at run_agent.py:1600 (self.session_start = datetime.now()), so the attribute lookup is safe. The hermes_time.now import on the current main branch is at line 4949 inside _build_system_prompt — removing it is a clean local-only change.
Both fixes are correct:
-
Timestamp —
"Conversation started: {now()}"was semantically wrong (it claimed to show when the conversation started but actually showed the current time on every turn). Usingself.session_startis both a correctness fix and a cache stability win. -
Memory headers — the
[49% — 1,085/2,200 chars]indicators mutated on every memory write (even adding a single character could shift the percentage by 1 point or move a comma), invalidating the entire downstream KV prefix. Removing them from the system prompt while keeping them in_success_response()tool payloads is the right trade-off.
This also benefits inference servers beyond llama.cpp that do prefix caching (vLLM's automatic prefix caching, TGI's --prefix-caching, SGLang's RadixAttention) — any byte-identical prefix matching benefits from eliminating these mutations.
Community PRs applied: - NousResearch#18596: Enable secret redaction by default (SECURITY) - NousResearch#18650: Sanitize malformed tool messages + auto-recover on API 400 - NousResearch#18607: Emergency compression before max_iterations exhaustion - NousResearch#18603: Compression fallback to main model on 413 rate limit - NousResearch#18638: Pass threshold_percent on model switch - NousResearch#18663: Strip extra_content from tool_calls for strict APIs - NousResearch#18618: Forward explicit_api_key to OpenRouter - NousResearch#18632: Show cache tokens in /insights breakdown - NousResearch#18614: Add idempotency guard for patch duplicate loops - NousResearch#18600: Raise ValueError when HERMES_HOME unset in profile mode - NousResearch#18616: Allow ZWJ emoji in context files - NousResearch#18582: Reload .env on /restart - NousResearch#18547: Stabilize system prompt prefix for KV cache reuse - NousResearch#18692: Strip FTS5 operators from session search truncation terms Fix: Add order_by_last_active=True to list_sessions_rich call (pre-existing commit 142b4bf code sync)
…ogging
The system prompt's 'Conversation started:' line carried minute precision
(%I:%M %p), making it byte-unstable across every rebuild path. Within a
CLI session the in-memory cache held, but on the gateway path (fresh
AIAgent per turn → restore from session DB), any silent failure in the
read or write path dropped the cache stem and forced a full re-prefill
on every subsequent turn. Local prefix-caching backends (llama.cpp /
vLLM) saw this as KV-cache invalidation; remote prefix-caching providers
saw it as an Anthropic-style cache miss.
Three changes:
1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM').
System prompt now byte-stable for the full day. The model can still
query exact time via tools when it actually needs it. Credit:
@iamfoz (PR #20451).
2. Loud logging on session DB write failures. The update_system_prompt
call used to log at DEBUG, hiding disk-full / locked-database / schema
drift behind a silent fall-through that forced fresh rebuilds on
every subsequent turn. Now WARN with the session id and exception so
persistent issues show up in agent.log without verbose mode.
3. Three-way stored-state distinction on read. The previous
'session_row.get("system_prompt") or None' collapsed three states
into one (missing row / null column / empty string). Now we tell them
apart and WARN when a continuing session lands on null/empty (which
means the previous turn's write never persisted — every subsequent
turn rebuilds and the prefix cache misses every time).
The restore block is extracted into _restore_or_build_system_prompt()
so the prefix-cache path can be unit-tested in isolation.
E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary
sleep restores byte-identical bytes from the session DB. NULL stored
prompt fires the new warning. Date-only timestamp survives the rebuild
path. All on real SessionDB, no mocks.
Tests:
- tests/agent/test_system_prompt_restore.py (10 new tests)
- tests/run_agent/test_run_agent.py::TestBuildSystemPrompt::
test_datetime_is_date_only_not_minute_precision
Closes #20451 (date-only), #18547 (prefix stabilization),
#8689 (stabilize timestamp across compression), #15866 (timestamp
caching question), #8687 (compression timestamp), #27339
(claim #3: live timestamp in cached system prompt).
Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>
|
Superseded by PR #27675 (merged commit 4a3f13b), which makes the system prompt byte-stable for the full day via a date-only line. Your analysis of llama.cpp's KV-cache restoration depending on byte-identical prefix matching was on the money — the date-only timestamp closes the minute-by-minute volatility source you identified. Thanks for the detailed root-cause writeup. |
|
Superseded by PR #27675 (merged commit 4a3f13b), which makes the system prompt byte-stable for the full day via a date-only |
…ogging
The system prompt's 'Conversation started:' line carried minute precision
(%I:%M %p), making it byte-unstable across every rebuild path. Within a
CLI session the in-memory cache held, but on the gateway path (fresh
AIAgent per turn → restore from session DB), any silent failure in the
read or write path dropped the cache stem and forced a full re-prefill
on every subsequent turn. Local prefix-caching backends (llama.cpp /
vLLM) saw this as KV-cache invalidation; remote prefix-caching providers
saw it as an Anthropic-style cache miss.
Three changes:
1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM').
System prompt now byte-stable for the full day. The model can still
query exact time via tools when it actually needs it. Credit:
@iamfoz (PR NousResearch#20451).
2. Loud logging on session DB write failures. The update_system_prompt
call used to log at DEBUG, hiding disk-full / locked-database / schema
drift behind a silent fall-through that forced fresh rebuilds on
every subsequent turn. Now WARN with the session id and exception so
persistent issues show up in agent.log without verbose mode.
3. Three-way stored-state distinction on read. The previous
'session_row.get("system_prompt") or None' collapsed three states
into one (missing row / null column / empty string). Now we tell them
apart and WARN when a continuing session lands on null/empty (which
means the previous turn's write never persisted — every subsequent
turn rebuilds and the prefix cache misses every time).
The restore block is extracted into _restore_or_build_system_prompt()
so the prefix-cache path can be unit-tested in isolation.
E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary
sleep restores byte-identical bytes from the session DB. NULL stored
prompt fires the new warning. Date-only timestamp survives the rebuild
path. All on real SessionDB, no mocks.
Tests:
- tests/agent/test_system_prompt_restore.py (10 new tests)
- tests/run_agent/test_run_agent.py::TestBuildSystemPrompt::
test_datetime_is_date_only_not_minute_precision
Closes NousResearch#20451 (date-only), NousResearch#18547 (prefix stabilization),
NousResearch#8689 (stabilize timestamp across compression), NousResearch#15866 (timestamp
caching question), NousResearch#8687 (compression timestamp), NousResearch#27339
(claim NousResearch#3: live timestamp in cached system prompt).
Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>
…ogging
The system prompt's 'Conversation started:' line carried minute precision
(%I:%M %p), making it byte-unstable across every rebuild path. Within a
CLI session the in-memory cache held, but on the gateway path (fresh
AIAgent per turn → restore from session DB), any silent failure in the
read or write path dropped the cache stem and forced a full re-prefill
on every subsequent turn. Local prefix-caching backends (llama.cpp /
vLLM) saw this as KV-cache invalidation; remote prefix-caching providers
saw it as an Anthropic-style cache miss.
Three changes:
1. Date-only timestamp ('Sunday, May 17, 2026' instead of '... 03:42 PM').
System prompt now byte-stable for the full day. The model can still
query exact time via tools when it actually needs it. Credit:
@iamfoz (PR NousResearch#20451).
2. Loud logging on session DB write failures. The update_system_prompt
call used to log at DEBUG, hiding disk-full / locked-database / schema
drift behind a silent fall-through that forced fresh rebuilds on
every subsequent turn. Now WARN with the session id and exception so
persistent issues show up in agent.log without verbose mode.
3. Three-way stored-state distinction on read. The previous
'session_row.get("system_prompt") or None' collapsed three states
into one (missing row / null column / empty string). Now we tell them
apart and WARN when a continuing session lands on null/empty (which
means the previous turn's write never persisted — every subsequent
turn rebuilds and the prefix cache misses every time).
The restore block is extracted into _restore_or_build_system_prompt()
so the prefix-cache path can be unit-tested in isolation.
E2E proof: fresh AIAgent constructed for turn 2 across a minute-boundary
sleep restores byte-identical bytes from the session DB. NULL stored
prompt fires the new warning. Date-only timestamp survives the rebuild
path. All on real SessionDB, no mocks.
Tests:
- tests/agent/test_system_prompt_restore.py (10 new tests)
- tests/run_agent/test_run_agent.py::TestBuildSystemPrompt::
test_datetime_is_date_only_not_minute_precision
Closes NousResearch#20451 (date-only), NousResearch#18547 (prefix stabilization),
NousResearch#8689 (stabilize timestamp across compression), NousResearch#15866 (timestamp
caching question), NousResearch#8687 (compression timestamp), NousResearch#27339
(claim NousResearch#3: live timestamp in cached system prompt).
Co-authored-by: Martyn Forryan <9133432+iamfoz@users.noreply.github.com>
fix(prompt): stabilize system prompt prefix for KV cache reuse
Summary
Two small changes to system prompt construction eliminate spurious KV cache invalidation between turns. The volatile content being removed has no semantic value to the model and was forcing unnecessary cold re-prefill on every turn it changed.
Particularly impactful on llama.cpp servers running hybrid attention models (e.g., Qwen3.5 family with Gated DeltaNet) where
--cache-reuseis unavailable and cache restoration relies on byte-identical prefix matching against stored checkpoints.Root Cause
llama.cpp's KV cache checkpoint restoration matches the new prompt's prefix against existing checkpoints. When the prefix mutates — even by a single character — the matched span shrinks and downstream tokens have to be re-prefilled from the divergence point.
Two pieces of system prompt content were mutating between turns despite no semantic change:
1.
tools/memory_tool.py:_render_block()included character-count usage indicators in section headers:Any memory write that bumped the percentage by one point, or shifted the comma position in the character count, invalidated all downstream cache.
2.
run_agent.py:_build_system_prompt()regenerated the conversation timestamp vianow()on every API call, despite the value semantically representing session start (which doesn't change within a session). Each minute rollover invalidated the cache.Changes
run_agent.pyUse
self.session_startfor the timestamp instead ofnow():session_startis what the line semantically claims to display anyway — this is also a correctness fix.tools/memory_tool.pyRemove volatile usage indicators from memory block headers:
Usage statistics remain available in the memory tool's response payload via
_success_response()— they're just no longer surfaced in the system prompt where they cost cache stability.Evidence
Tested on a llama.cpp
b8683-d0a6dfeb2server running Qwen3.5-27B-UD-Q5_K_XL on a Tesla V100-SXM2-32GB.Before:
llama-serverlogs during multi-turn conversations showed prefix similarity collapsing whenever a memory write or minute rollover shifted bytes in the prompt, despite no semantic change in the conversation:That's a full cold prefill — for a 22K-token conversation prefix on the V100, roughly 24-30 seconds of wasted work per turn.
After: Successive turns in agentic tool-call chains maintain near-perfect prefix similarity:
Partial invalidation still occurs at semantically meaningful boundaries — e.g., a tool-category switch from email search to OCR dropped similarity to 0.683, which is correct behavior because the recent context genuinely changed.
Scope and Related Concerns
This PR addresses prompt-construction sources of cache invalidation only.
A related but architecturally distinct issue exists in
agent/auxiliary_client.py: auxiliary tasks (title generation, summarization) default toprovider: "auto", which routes them through the main model endpoint. On local llama.cpp deployments this destroys the main conversation's checkpoint state on every call (~1.2% prefix similarity between the title generation prompt and a typical conversation context). That's a separate concern that warrants its own PR — the right fix is probably to defaultauxiliary.*tasks toprovider: noneor to a separate endpoint when one is configured, with a logged warning if they're routed to the main endpoint.In real-world combined testing on the test setup — applying both these prompt fixes and the workaround
auxiliary.title_generation.provider: none— multi-turn tool-call chain runtime dropped from roughly 7-9 minutes to 3-4 minutes on equivalent workloads. The split between the two fixes hasn't been isolated, but both contribute meaningfully and both should land.Backwards Compatibility
No breaking changes:
session_start, which is what the line says it represents)Testing
Manual verification confirms the fixes are correctly applied:
End-to-end validation via
llama-serverlog analysis on multi-turn tool-call chains over a 26K-token context. Post-fix, cache invalidation occurs only at legitimate context boundaries (tool category change, context truncation), not on semantically inert prompt mutations.To observe cache behaviour live on a running llama-server instance:
Healthy indicators:
sim_best > 0.95on consecutive turns,memory_seq_rmpositions advancing rather than resetting to[0, end). Pathological indicator:sim = 0.0xxormemory_seq_rm [0, end)on turns where no semantic context shift occurred.Files Changed
tools/memory_tool.py— 6 lines removed, 5 lines added (header simplification + docstring)run_agent.py— 3 lines removed, 4 lines added (timestamp source + comment)