[Bug]: Honcho auto-injected context rebuilds cached system prompt every N turns, invalidating KV cache on every prefix-caching backend

## Bug Description

`AGENTS.md` states a hard policy on prompt caching:

> ### Prompt Caching Must Not Break
>
> Hermes-Agent ensures caching remains valid throughout a conversation. **Do NOT implement changes that would:**
> - Alter past context mid-conversation
> - Change toolsets mid-conversation
> - **Reload memories or rebuild system prompts mid-conversation**
>
> Cache-breaking forces dramatically higher costs. The ONLY time we alter context is during context compression.

The Honcho memory provider violates this directly.

Per the **Prompt Assembly** architecture doc, Layer 3 of the **cached** system prompt is the "Honcho static block (when active)". Per the **Honcho Memory** integration doc, that same block is:

- **Base context** — session summary, user representation, user peer card, AI self-representation, AI identity card. *"Refreshed on `contextCadence`."*
- **Dialectic supplement** — LLM-synthesized reasoning about the user's current state. *"Refreshed on `dialecticCadence`."*

Every refresh rewrites Layer 3 of the cached prompt **mid-conversation**, invalidating the entire KV cache on every backend that does prefix caching:

- Local: `llama.cpp`, `vLLM --enable-prefix-caching`, MLX with PagedAttention
- Cloud: Anthropic / OpenAI / Bedrock with explicit cache control

Causal attention means mid-prefix mutations invalidate everything after the first divergence — every token downstream of the Honcho block must be reprocessed from scratch. On a ~20K-token session that is a multi-minute reprocess on every cadence tick.

The same Prompt Assembly doc already classifies *"later-turn Honcho recall injected into the current-turn user message"* as an **API-call-time-only layer** (explicitly **not** persisted to the cached system prompt) — exactly the correct pattern. The auto-injected base / dialectic block just hasn't been moved over.

---

## Steps to Reproduce

1. Configure Hermes with `recallMode: \"hybrid\"` (or `\"context\"`) and any finite `contextCadence` / `dialecticCadence`.
2. Point Hermes at a backend with prefix caching enabled (e.g. local `vllm --enable-prefix-caching`, `llama.cpp`, MLX via vLLM-MLX).
3. Send the first user message — first-turn full prefill is expected.
4. Send *N−1* more messages where *N* = `min(contextCadence, dialecticCadence)`. Prefix cache hits as expected.
5. Send one more message. Observe: **full prefill from scratch**, not just the new tail tokens.

**Cross-check:** run the same backend + model under a client that does not mutate the system prompt mid-session (e.g. ` opencode`). Prefix cache hits on every turn, confirming the backend is not at fault.

---

## Expected Behavior

Per the stated policy: the cached system prompt is byte-identical for the duration of a session (compression excepted).

Auto-injected Honcho context rides on the current user message, same path as the existing \`_ext_prefetch_cache\` injection (\`run_agent.py\` ~L6737–6740), and is therefore ephemeral and cache-safe.

---

## Actual Behavior

Every \`contextCadence\` / \`dialecticCadence\` tick rewrites Layer 3 of the cached system prompt.

Observed on Gemma 4 31B 8-bit (MLX) served via a \`vllm-mlx\` fork with PagedAttention prefix caching:

| \`contextCadence\` | Cache-miss rate | Per-turn latency on miss |
| --- | --- | --- |
| \`1\` | 100 % of turns | ~5 min (≈20K tokens full prefill) |
| \`4\` | ~25 % of turns | ~5 min on every 4th turn |

With \`contextCadence: 1\` every user turn is a full prefill. Bumping to 4 reduces the rate to one in four, but the underlying design is still wrong — on any multi-turn session, some fraction of turns will always incur a full prefill regardless of how it is tuned.

---

## Root Cause

\`agent/prompt_builder.py\` emits the Honcho static block inside \`_build_system_prompt()\`, whose result is cached on \`self._cached_system_prompt\`. The Honcho memory provider's \`prefetch_all()\` result populates this block and is deliberately refreshed on cadence, contradicting the cached-prompt contract.

The correct injection site already exists: \`_ext_prefetch_cache\` in \`run_agent.py\` appends per-turn context to the current user message (same file, ~L6737–6740 — also referenced in #5719). Only a subset of Honcho output currently goes through that path.

---

## Proposed Fix

1. Remove the Honcho static block from Layer 3 of \`_build_system_prompt()\` in \`agent/prompt_builder.py\`.
2. Extend \`_ext_prefetch_cache\` (or an equivalent per-turn slot) in \`run_agent.py\` to carry the formatted \`## Honcho Context\` block for the current turn only.
3. Document and enforce the invariant: **the cached system prompt is byte-stable for the session**; any Honcho content that varies turn-to-turn must ride on the current user message.

This is the same architectural move already proposed in #3353 for runtime metadata, and enforces the policy stated in \`AGENTS.md\`. It subsumes the \`contextCadence\` / \`dialecticCadence\` knobs under the same correctness guarantee instead of requiring users to manually trade off memory freshness against prefill cost.

---

## Environment

| Component | Value |
| --- | --- |
| Hermes Agent | **v0.10.0 (2026.4.16)** — 34 commits behind \`main\` |
| Python | 3.11.15 |
| OpenAI SDK | 2.31.0 |
| OS | macOS 26.4.1 (Darwin 25.4.0, \`arm64 T6020\` / M2 Pro) |
| Client host | \`sparta.local\` |
| Inference backend | Separate M2 Ultra (128 GB) running a custom \`mlx-hosting\` glue in front of a \`vllm-mlx\` fork with PagedAttention prefix caching; model Gemma 4 31B 8-bit MLX |
| Honcho | Self-hosted, \`baseUrl: http://127.0.0.1:8000\` |

Relevant Honcho config (\`~/.hermes/honcho.json\`):

\`\`\`json
{
  \"recallMode\": \"hybrid\",
  \"contextCadence\": 4,
  \"dialecticCadence\": 4,
  \"dialecticDepth\": 1,
  \"dialecticReasoningLevel\": \"medium\",
  \"observationMode\": \"directional\",
  \"writeFrequency\": \"async\",
  \"sessionStrategy\": \"per-directory\"
}
\`\`\`

---

## Related Issues

Same architectural class — mid-session mutation of cached prompt state:

- #3353 — \`perf(ttft): move runtime metadata out of the cached system prompt\` — **same principle, different varying field**; this bug is a much larger, more frequently varying instance of the same anti-pattern.
- #8687 — \`perf: system prompt timestamp changes after compression, invalidating prefix cache on local LLM backends\` — same class, timestamp specifically.
- #4319 — \`[Performance] KV cache invalidation on compression hurts local MoE models\` — same class, compression trigger.
- #4555 — \`[Bug]: KV cache invalidation on new user message due to message format differences between agentic loop and session reload\` — byte-stability principle for messages on reload.
- #13442 — \`[Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B\` — downstream symptom of the same invariant being violated.

Adjacent Honcho issues (for completeness, **not** duplicates of this bug):

- #5719 — Honcho context block leaking into vision auto-analysis output
- #5102 — Thread-safety race in \`HonchoSessionManager._cache\`
- #4751 — Sequential chat dispatch / session-key / peer-card issues
- #2150 — Feature request for cached startup context in \`recallMode=tools\`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Honcho auto-injected context rebuilds cached system prompt every N turns, invalidating KV cache on every prefix-caching backend #13631

Bug Description

Prompt Caching Must Not Break

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Proposed Fix

Environment

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

`contextCadence`	Cache-miss rate	Per-turn latency on miss
`1`	100 % of turns	~5 min (≈20K tokens full prefill)
`4`	~25 % of turns	~5 min on every 4th turn

Component	Value
Hermes Agent	v0.10.0 (2026.4.16) — 34 commits behind `main`
Python	3.11.15
OpenAI SDK	2.31.0
OS	macOS 26.4.1 (Darwin 25.4.0, `arm64 T6020` / M2 Pro)
Client host	`sparta.local`
Inference backend	Separate M2 Ultra (128 GB) running a custom `mlx-hosting` glue in front of a `vllm-mlx` fork with PagedAttention prefix caching; model Gemma 4 31B 8-bit MLX
Honcho	Self-hosted, `baseUrl: http://127.0.0.1:8000\`

[Bug]: Honcho auto-injected context rebuilds cached system prompt every N turns, invalidating KV cache on every prefix-caching backend #13631

Description

Bug Description

Prompt Caching Must Not Break

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Proposed Fix

Environment

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions