Skip to content

[Bug]: Honcho auto-injected context rebuilds cached system prompt every N turns, invalidating KV cache on every prefix-caching backend #13631

@0xAlcibiades

Description

@0xAlcibiades

Bug Description

AGENTS.md states a hard policy on prompt caching:

Prompt Caching Must Not Break

Hermes-Agent ensures caching remains valid throughout a conversation. Do NOT implement changes that would:

  • Alter past context mid-conversation
  • Change toolsets mid-conversation
  • Reload memories or rebuild system prompts mid-conversation

Cache-breaking forces dramatically higher costs. The ONLY time we alter context is during context compression.

The Honcho memory provider violates this directly.

Per the Prompt Assembly architecture doc, Layer 3 of the cached system prompt is the "Honcho static block (when active)". Per the Honcho Memory integration doc, that same block is:

  • Base context — session summary, user representation, user peer card, AI self-representation, AI identity card. "Refreshed on contextCadence."
  • Dialectic supplement — LLM-synthesized reasoning about the user's current state. "Refreshed on dialecticCadence."

Every refresh rewrites Layer 3 of the cached prompt mid-conversation, invalidating the entire KV cache on every backend that does prefix caching:

  • Local: llama.cpp, vLLM --enable-prefix-caching, MLX with PagedAttention
  • Cloud: Anthropic / OpenAI / Bedrock with explicit cache control

Causal attention means mid-prefix mutations invalidate everything after the first divergence — every token downstream of the Honcho block must be reprocessed from scratch. On a ~20K-token session that is a multi-minute reprocess on every cadence tick.

The same Prompt Assembly doc already classifies "later-turn Honcho recall injected into the current-turn user message" as an API-call-time-only layer (explicitly not persisted to the cached system prompt) — exactly the correct pattern. The auto-injected base / dialectic block just hasn't been moved over.


Steps to Reproduce

  1. Configure Hermes with recallMode: \"hybrid\" (or \"context\") and any finite contextCadence / dialecticCadence.
  2. Point Hermes at a backend with prefix caching enabled (e.g. local vllm --enable-prefix-caching, llama.cpp, MLX via vLLM-MLX).
  3. Send the first user message — first-turn full prefill is expected.
  4. Send N−1 more messages where N = min(contextCadence, dialecticCadence). Prefix cache hits as expected.
  5. Send one more message. Observe: full prefill from scratch, not just the new tail tokens.

Cross-check: run the same backend + model under a client that does not mutate the system prompt mid-session (e.g. opencode). Prefix cache hits on every turn, confirming the backend is not at fault.


Expected Behavior

Per the stated policy: the cached system prompt is byte-identical for the duration of a session (compression excepted).

Auto-injected Honcho context rides on the current user message, same path as the existing `_ext_prefetch_cache` injection (`run_agent.py` ~L6737–6740), and is therefore ephemeral and cache-safe.


Actual Behavior

Every `contextCadence` / `dialecticCadence` tick rewrites Layer 3 of the cached system prompt.

Observed on Gemma 4 31B 8-bit (MLX) served via a `vllm-mlx` fork with PagedAttention prefix caching:

`contextCadence` Cache-miss rate Per-turn latency on miss
`1` 100 % of turns ~5 min (≈20K tokens full prefill)
`4` ~25 % of turns ~5 min on every 4th turn

With `contextCadence: 1` every user turn is a full prefill. Bumping to 4 reduces the rate to one in four, but the underlying design is still wrong — on any multi-turn session, some fraction of turns will always incur a full prefill regardless of how it is tuned.


Root Cause

`agent/prompt_builder.py` emits the Honcho static block inside `_build_system_prompt()`, whose result is cached on `self._cached_system_prompt`. The Honcho memory provider's `prefetch_all()` result populates this block and is deliberately refreshed on cadence, contradicting the cached-prompt contract.

The correct injection site already exists: `_ext_prefetch_cache` in `run_agent.py` appends per-turn context to the current user message (same file, ~L6737–6740 — also referenced in #5719). Only a subset of Honcho output currently goes through that path.


Proposed Fix

  1. Remove the Honcho static block from Layer 3 of `_build_system_prompt()` in `agent/prompt_builder.py`.
  2. Extend `_ext_prefetch_cache` (or an equivalent per-turn slot) in `run_agent.py` to carry the formatted `## Honcho Context` block for the current turn only.
  3. Document and enforce the invariant: the cached system prompt is byte-stable for the session; any Honcho content that varies turn-to-turn must ride on the current user message.

This is the same architectural move already proposed in #3353 for runtime metadata, and enforces the policy stated in `AGENTS.md`. It subsumes the `contextCadence` / `dialecticCadence` knobs under the same correctness guarantee instead of requiring users to manually trade off memory freshness against prefill cost.


Environment

Component Value
Hermes Agent v0.10.0 (2026.4.16) — 34 commits behind `main`
Python 3.11.15
OpenAI SDK 2.31.0
OS macOS 26.4.1 (Darwin 25.4.0, `arm64 T6020` / M2 Pro)
Client host `sparta.local`
Inference backend Separate M2 Ultra (128 GB) running a custom `mlx-hosting` glue in front of a `vllm-mlx` fork with PagedAttention prefix caching; model Gemma 4 31B 8-bit MLX
Honcho Self-hosted, `baseUrl: http://127.0.0.1:8000\`

Relevant Honcho config (`~/.hermes/honcho.json`):

```json
{
"recallMode": "hybrid",
"contextCadence": 4,
"dialecticCadence": 4,
"dialecticDepth": 1,
"dialecticReasoningLevel": "medium",
"observationMode": "directional",
"writeFrequency": "async",
"sessionStrategy": "per-directory"
}
```


Related Issues

Same architectural class — mid-session mutation of cached prompt state:

Adjacent Honcho issues (for completeness, not duplicates of this bug):

Metadata

Metadata

Assignees

No one assigned

    Labels

    comp/agentCore agent loop, run_agent.py, prompt buildertype/perfPerformance improvement or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions