[Performance] KV cache invalidation on compression hurts local MoE models — defer unnecessary system prompt rebuilds

## Problem

Users running local MoE models (Qwen3.5 35B, Mixtral, etc.) via Ollama/vLLM report severe performance degradation during long sessions. Every context compression cycle invalidates the KV cache by rebuilding the system prompt, forcing the model to reprocess the full context from scratch on the next turn.

On a 35B MoE model, reprocessing 30k+ tokens of context takes significant time — users report multi-minute pauses after compression.

## Current architecture (mostly correct)

The system prompt is already designed for prefix cache stability:
- `_build_system_prompt()` (line 2529) is built once per session and cached on `self._cached_system_prompt`
- The docstring explicitly notes this maximizes prefix cache hits (line 2533-2535)
- Subsequent turns reuse the stored prompt for Anthropic cache prefix matching (line 6174)
- Memory flushes write to disk but do NOT invalidate the cache

The only invalidation point is `_invalidate_system_prompt()` (line 2878), called after compression at line 5192.

## What happens during compression

1. Pre-compression memory flush writes new memories to disk (line 5184)
2. `_invalidate_system_prompt()` clears the cache and reloads memory from disk (line 2885-2887)
3. `_build_system_prompt()` rebuilds from scratch with the updated memory (line 5193)
4. The new system prompt has different memory content → different token prefix → KV cache miss

## Optimization opportunities

### 1. Skip rebuild when memory content hasn't changed

After compression, check if the memory content is identical to what was already in the cached prompt. If so, don't rebuild — just update the compression summary in the conversation history.

```python
def _invalidate_system_prompt(self):
    old_prompt = self._cached_system_prompt
    self._cached_system_prompt = None
    if self._memory_store:
        self._memory_store.load_from_disk()
    # Rebuild and compare — if identical, restore the old prompt
    new_prompt = self._build_system_prompt()
    if new_prompt == old_prompt:
        self._cached_system_prompt = old_prompt  # preserve KV cache
    else:
        self._cached_system_prompt = new_prompt
```

### 2. Prefix-stable prompt ordering

Structure the system prompt so volatile sections (memory, compression notes) are at the END:

```
[STABLE] Agent identity (SOUL.md)
[STABLE] Skills guidance
[STABLE] Context files (AGENTS.md)
[STABLE] Tool definitions
[VOLATILE] Memory snapshot
[VOLATILE] Compression summary note
[VOLATILE] Date/time
```

This way, even when memory changes, Ollama/vLLM can reuse the KV cache for the stable prefix portion.

### 3. Config option to disable memory-in-system-prompt for local models

Local model users who prioritize inference speed over memory freshness could opt out:

```yaml
memory:
  include_in_system_prompt: false  # keep memories on disk, don't embed in prompt
```

## Impact

- **Severity**: Medium — performance problem, not correctness
- **Affected users**: Anyone running local models via Ollama/vLLM, especially MoE architectures
- **Note**: This is NOT an agent harness bug — it's an optimization. The current behavior (rebuild after compression) is correct; it just has a performance cost that's invisible on cloud APIs (which handle caching server-side) and painful on local inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] KV cache invalidation on compression hurts local MoE models — defer unnecessary system prompt rebuilds #4319

Problem

Current architecture (mostly correct)

What happens during compression

Optimization opportunities

1. Skip rebuild when memory content hasn't changed

2. Prefix-stable prompt ordering

3. Config option to disable memory-in-system-prompt for local models

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Performance] KV cache invalidation on compression hurts local MoE models — defer unnecessary system prompt rebuilds #4319

Description

Problem

Current architecture (mostly correct)

What happens during compression

Optimization opportunities

1. Skip rebuild when memory content hasn't changed

2. Prefix-stable prompt ordering

3. Config option to disable memory-in-system-prompt for local models

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions