Skip to content

[Performance] KV cache invalidation on compression hurts local MoE models — defer unnecessary system prompt rebuilds #4319

@SHL0MS

Description

@SHL0MS

Problem

Users running local MoE models (Qwen3.5 35B, Mixtral, etc.) via Ollama/vLLM report severe performance degradation during long sessions. Every context compression cycle invalidates the KV cache by rebuilding the system prompt, forcing the model to reprocess the full context from scratch on the next turn.

On a 35B MoE model, reprocessing 30k+ tokens of context takes significant time — users report multi-minute pauses after compression.

Current architecture (mostly correct)

The system prompt is already designed for prefix cache stability:

  • _build_system_prompt() (line 2529) is built once per session and cached on self._cached_system_prompt
  • The docstring explicitly notes this maximizes prefix cache hits (line 2533-2535)
  • Subsequent turns reuse the stored prompt for Anthropic cache prefix matching (line 6174)
  • Memory flushes write to disk but do NOT invalidate the cache

The only invalidation point is _invalidate_system_prompt() (line 2878), called after compression at line 5192.

What happens during compression

  1. Pre-compression memory flush writes new memories to disk (line 5184)
  2. _invalidate_system_prompt() clears the cache and reloads memory from disk (line 2885-2887)
  3. _build_system_prompt() rebuilds from scratch with the updated memory (line 5193)
  4. The new system prompt has different memory content → different token prefix → KV cache miss

Optimization opportunities

1. Skip rebuild when memory content hasn't changed

After compression, check if the memory content is identical to what was already in the cached prompt. If so, don't rebuild — just update the compression summary in the conversation history.

def _invalidate_system_prompt(self):
    old_prompt = self._cached_system_prompt
    self._cached_system_prompt = None
    if self._memory_store:
        self._memory_store.load_from_disk()
    # Rebuild and compare — if identical, restore the old prompt
    new_prompt = self._build_system_prompt()
    if new_prompt == old_prompt:
        self._cached_system_prompt = old_prompt  # preserve KV cache
    else:
        self._cached_system_prompt = new_prompt

2. Prefix-stable prompt ordering

Structure the system prompt so volatile sections (memory, compression notes) are at the END:

[STABLE] Agent identity (SOUL.md)
[STABLE] Skills guidance
[STABLE] Context files (AGENTS.md)
[STABLE] Tool definitions
[VOLATILE] Memory snapshot
[VOLATILE] Compression summary note
[VOLATILE] Date/time

This way, even when memory changes, Ollama/vLLM can reuse the KV cache for the stable prefix portion.

3. Config option to disable memory-in-system-prompt for local models

Local model users who prioritize inference speed over memory freshness could opt out:

memory:
  include_in_system_prompt: false  # keep memories on disk, don't embed in prompt

Impact

  • Severity: Medium — performance problem, not correctness
  • Affected users: Anyone running local models via Ollama/vLLM, especially MoE architectures
  • Note: This is NOT an agent harness bug — it's an optimization. The current behavior (rebuild after compression) is correct; it just has a performance cost that's invisible on cloud APIs (which handle caching server-side) and painful on local inference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions