Problem
Users running local MoE models (Qwen3.5 35B, Mixtral, etc.) via Ollama/vLLM report severe performance degradation during long sessions. Every context compression cycle invalidates the KV cache by rebuilding the system prompt, forcing the model to reprocess the full context from scratch on the next turn.
On a 35B MoE model, reprocessing 30k+ tokens of context takes significant time — users report multi-minute pauses after compression.
Current architecture (mostly correct)
The system prompt is already designed for prefix cache stability:
_build_system_prompt() (line 2529) is built once per session and cached on self._cached_system_prompt
- The docstring explicitly notes this maximizes prefix cache hits (line 2533-2535)
- Subsequent turns reuse the stored prompt for Anthropic cache prefix matching (line 6174)
- Memory flushes write to disk but do NOT invalidate the cache
The only invalidation point is _invalidate_system_prompt() (line 2878), called after compression at line 5192.
What happens during compression
- Pre-compression memory flush writes new memories to disk (line 5184)
_invalidate_system_prompt() clears the cache and reloads memory from disk (line 2885-2887)
_build_system_prompt() rebuilds from scratch with the updated memory (line 5193)
- The new system prompt has different memory content → different token prefix → KV cache miss
Optimization opportunities
1. Skip rebuild when memory content hasn't changed
After compression, check if the memory content is identical to what was already in the cached prompt. If so, don't rebuild — just update the compression summary in the conversation history.
def _invalidate_system_prompt(self):
old_prompt = self._cached_system_prompt
self._cached_system_prompt = None
if self._memory_store:
self._memory_store.load_from_disk()
# Rebuild and compare — if identical, restore the old prompt
new_prompt = self._build_system_prompt()
if new_prompt == old_prompt:
self._cached_system_prompt = old_prompt # preserve KV cache
else:
self._cached_system_prompt = new_prompt
2. Prefix-stable prompt ordering
Structure the system prompt so volatile sections (memory, compression notes) are at the END:
[STABLE] Agent identity (SOUL.md)
[STABLE] Skills guidance
[STABLE] Context files (AGENTS.md)
[STABLE] Tool definitions
[VOLATILE] Memory snapshot
[VOLATILE] Compression summary note
[VOLATILE] Date/time
This way, even when memory changes, Ollama/vLLM can reuse the KV cache for the stable prefix portion.
3. Config option to disable memory-in-system-prompt for local models
Local model users who prioritize inference speed over memory freshness could opt out:
memory:
include_in_system_prompt: false # keep memories on disk, don't embed in prompt
Impact
- Severity: Medium — performance problem, not correctness
- Affected users: Anyone running local models via Ollama/vLLM, especially MoE architectures
- Note: This is NOT an agent harness bug — it's an optimization. The current behavior (rebuild after compression) is correct; it just has a performance cost that's invisible on cloud APIs (which handle caching server-side) and painful on local inference.
Problem
Users running local MoE models (Qwen3.5 35B, Mixtral, etc.) via Ollama/vLLM report severe performance degradation during long sessions. Every context compression cycle invalidates the KV cache by rebuilding the system prompt, forcing the model to reprocess the full context from scratch on the next turn.
On a 35B MoE model, reprocessing 30k+ tokens of context takes significant time — users report multi-minute pauses after compression.
Current architecture (mostly correct)
The system prompt is already designed for prefix cache stability:
_build_system_prompt()(line 2529) is built once per session and cached onself._cached_system_promptThe only invalidation point is
_invalidate_system_prompt()(line 2878), called after compression at line 5192.What happens during compression
_invalidate_system_prompt()clears the cache and reloads memory from disk (line 2885-2887)_build_system_prompt()rebuilds from scratch with the updated memory (line 5193)Optimization opportunities
1. Skip rebuild when memory content hasn't changed
After compression, check if the memory content is identical to what was already in the cached prompt. If so, don't rebuild — just update the compression summary in the conversation history.
2. Prefix-stable prompt ordering
Structure the system prompt so volatile sections (memory, compression notes) are at the END:
This way, even when memory changes, Ollama/vLLM can reuse the KV cache for the stable prefix portion.
3. Config option to disable memory-in-system-prompt for local models
Local model users who prioritize inference speed over memory freshness could opt out:
Impact